(BEING CONTINUED FROM 3/08/18)
Best Practices from Lessons Learned
In this section we highlight some of the best practices from the lessons learned during implementation of GrepTheWeb.
Best Practices of Amazon S3
Upload Large Files, Retrieve Small Offsets
End-to-end transfer data rates in Amazon S3 are best when large files are stored instead of small tiny files (sizes in the lower KBs). So instead of storing individual files on Amazon S3, multiple files were bundled and compressed (gzip) into a blob (.gz) and then stored on Amazon S3 as objects. The individual files were retrieved using the standard HTTP GET request by providing a URL (bucket and key), offset (byte-range), and size (bytelength). As a result, the overall cost of storage was reduced due to reduction in the overall size of the dataset (because of compression) and consequently the lesser number of PUT requests required than otherwise.
Sort the Keys and Then Upload Your Dataset
Amazon S3 reconcilers show performance improvement if the keys are pre-sorted before upload. By running a small script, the keys (URL pointers) were sorted and then uploaded in sorted order to Amazon S3.
Use Multi-threaded Fetching
Instead of fetching objects one by one from Amazon S3, multiple concurrent fetch threads were started within each map task to fetch the objects. However, it is not advisable to spawn 100s of threads because every node has bandwidth constraints. Ideally, users should try slowly ramping up their number of concurrent parallel threads until they find the point where adding additional threads offers no further speed improvement.
Use Exponential Back-off and Then Retry
A reasonable approach for any application is to retry every failed web service request. What is not obvious is what strategy to use to determine the retry interval. Our recommended approach is to use the truncated binary exponential back-off. In this approach the exact time to sleep between retries is determined by a combination of successively doubling the number of seconds that the maximum delay may be and choosing randomly a value in that range. We recommended that you build the exponential backoff, sleep, and retry logic into the error handling code of your client. Exponential back-off reduces the number of requests made to Amazon S3 and thereby reduces the overall cost, while not overloading any part of the system.
Store Reference Information in the Message
Amazon SQS is ideal for small short-lived messages in workflows and processing pipelines. To stay within the message size limits it is advisable to store reference information as a part of the message and to store the actual input file on Amazon S3. In GrepTheWeb, the launch queue message contains the URL of the input file (.dat.gz) which is a small subset of a result set (Million Search results that can have up to 10 million links). Likewise, the shutdown queue message contains the URL of the output file (.dat.gz), which is a filtered result set containing the links which match the regular expression. The following tables show the message format of the queue and their statuses
Use Process-oriented Messaging and Document oriented Messaging
There are two messaging approaches that have worked effectively for us: process oriented and document oriented messaging. Process-oriented messaging is often defined by process or actions. The typical approach is to delete the old message from the “from” queue, and then to add a new message with new attributes to the new “to” queue. Document-oriented messaging happens when one message per user/job thread is passed through the entire system with different message attributes. This is often implemented using XML/JSON because it has an extensible model. In this solution, messages can evolve, except that the receiver only needs to understand those parts that are important to him. This way a single message can flow through the system and the different components only need to understand the parts of the message that is important to them. For GrepTheWeb, we decided to use the process-oriented approach.
Take Advantage Of Visibility Timeout Feature
Amazon SQS has a special functionality that is not present in many other messaging systems; when a message is read from the queue it is visible to other readers of the queue yet it is not automatically deleted from the queue. The consumer needs to explicitly delete the message from the queue. If this hasn’t happened within a certain period after the message was read, the consumer is considered to have failed and the message will re-appear in the queue to be consumed again. This is done by setting the so-called visibility timeout when creating the queue. In GrepTheWeb, the visibility timeout is very important because certain processes (such as the shutdown controller) might fail and not respond (e.g., instances would stay up). With the visibility timeout set to a certain number of minutes, another controller thread would pick up the old message and resume the task (of shutting down).
Best practices of Amazon SimpleDB
Multithread GetAttributes() and PutAttributes() In Amazon SimpleDB, domains have items, and items have attributes. Querying Amazon SimpleDB returns a set of items. But often, attribute values are needed to perform a particular task. In that case, a query call is followed by a series of GetAttributes calls to get the attributes of each item in the list. As you can guess, the execution time would be slow. To address this, it is highly recommended to multi-thread your GetAttributes calls and to run them in parallel. The overall performance increases dramatically (up to 50 times) when run in parallel. In the GrepTheWeb application to generate monthly activity reports, this approach helped create more dynamic reports.
Use Amazon SimpleDB in Conjunction With Other Services
Build frameworks, libraries and utilities that use functionality of two or more services together in one. For GrepTheWeb, we built a small framework that uses Amazon SQS and Amazon SimpleDB together to externalize appropriate state. For example, all controllers are inherited from the BaseController class. The BaseController class’s main responsibility is to dequeue the message from the “from” queue, validate the statuses from a particular Amazon SimpleDB domain, execute the function, update the statuses with a new timestamp and status, and put a new message in the “to” queue. The advantage of such a setup is that in an event of hardware failure or when controller instance dies, a new node can be brought up almost immediately and resume the state of operation by getting the messages back from the Amazon SQS queue and their status from Amazon SimpleDB upon reboot and makes the overall system more resilient. Although not used in this design, a common practice is to store actual files as objects on Amazon S3 and to store all the metadata related to the object on Amazon SimpleDB. Also, using an Amazon S3 key to the object as item name in Amazon SimpleDB is a common practice.
Launch Multiple Instances
All At Once Instead of waiting for your EC2 instances to boot up one by one, we recommend that you start all of them at once with a simple run-instances command that specifies the number of instances of each type.
Automate As Much As Possible
This is applicable in everything we do and requires a special mention because automation of Amazon EC2 is often ignored. One of the biggest features of Amazon EC2 is that you can provision any number of compute instances by making a simple web service call. Automation will empower the developer to run a dynamic programmable datacenter that expands and contracts based on his needs. For example, automating your buildtest-deploy cycle in the form of an Amazon Machine Image (AMI) and then running it automatically on Amazon EC2 every night (using a CRON job) will save a lot of time. By automating the AMI creation process, one can save a lot of time in configuration and optimization.
Add Compute Instances
On-The-Fly With Amazon EC2, we can fire up a node within minutes. Hadoop supports the dynamic addition of new nodes and task tracker nodes to a running cluster. One can simply launch new compute instances and start Hadoop processes on them, point them to the master and dynamically grow (and shrink) the cluster in real-time to speed up the overall process.
Safeguard Your AWS credentials
When Bundling an AMI If your AMI is running processes that need to communicate with other AWS web services (for polling the Amazon SQS queue or for reading objects from Amazon S3), one common design mistake is embedding the AWS credentials in the AMI. Instead of embedding the credentials, they should be passed in as arguments using the parameterized launch feature and encrypted before being sent over the wire.
General steps are:
1. Generate a new RSA keypair (use OpenSSL tools).
2. Copy the private key onto the image, before you bundle it (so it will be embedded in the final AMI).
3. Post the public key along with the image details, so users can use it.
4. When a user launches the image they must first encrypt their AWS secret key (or private key if you wanted to use SOAP) with the public key you gave them in step3. This encrypted data should be injected via user-data at launch (i.e. the parameterized launch feature).
5. Your image can then decrypt this at boot time and use it to decrypt the data required to contact Amazon S3. Also be sure to delete this private key upon reboot before installing the SSH key (i.e. before users can log into the machine). If users won’t have root access then you don’t have to delete the private key, just make sure it’s not readable by users other than root.
Thanks to Kenji Matsuoka and Tinou Bao – the core team that developed the GrepTheWeb Architecture.
Further Reading Amazon SimpleDB
White Papers Amazon SQS
White paper Hadoop
Wiki Hadoop Website
Distributed Grep Examples
Map Reduce Paper
Blog: Taking Massive Distributed Computing to the Common man – Hadoop on Amazon EC2/S3
Since the Amazon Web Services are primitive building block services, the most value is derived when they are used in conjunction with other services
• Use Amazon S3 and Amazon SimpleDB together whenever you want to query Amazon S3 objects using their metadata We recommend you store large files on Amazon S3 and the associated metadata and reference information on Amazon SimpleDB so that developers can query the metadata. Read-only metadata can also be stored on Amazon S3 as metadata on object (e.g. author, create date etc).
• Use SimpleDB and Amazon SQS together whenever you want an application to be in phases Store transient messages in Amazon SQS and statuses of job/messages in Amazon SimpleDB so that you can update statuses frequently and get the status of any request at any time by simply querying the item. This works especially well in asynchronous systems.
• Use Amazon S3 and Amazon SQS together whenever you want to create processing pipelines or producerconsumer solutions Store raw files on Amazon S3 and insert a corresponding message in an Amazon SQS queue with reference and metadata (S3 URI etc)
Amazon Web Services