(CONTINUED FROM 9/09/15)
Zooming in further, GrepTheWeb architecture looks like as shown in Figure 2 (above). It uses the following AWS components:
Amazon S3 for retrieving input datasets and for storing the output dataset
Amazon SQS for durably buffering requests acting as a ―glue‖ between controllers
Amazon SimpleDB for storing intermediate status, log, and for user data about tasks
Amazon EC2 for running a large distributed processing Hadoop cluster on-demand
Hadoop for distributed processing, automatic parallelization, and job scheduling
GrepTheWeb is modular. It does its processing in four phases as shown in figure 3. The launch phase is responsible for validating and initiating the processing of a GrepTheWeb request, instantiating Amazon EC2 instances, launching the Hadoop cluster on them and starting all the job processes. The monitor phase is responsible for monitoring the EC2 cluster, maps, reduces, and checking for success and failure. The shutdown phase is responsible for billing and shutting down all Hadoop processes and Amazon EC2 instances, while the cleanup phase deletes Amazon SimpleDB transient data.
The Use of Amazon Web Services
In the next four subsections we present rationales of use and describe how GrepTheWeb uses AWS services.
How Was Amazon S3 Used
In GrepTheWeb, Amazon S3 acts as an input as well as an output data store. The input to GrepTheWeb is the web itself (compressed form of Alexa’s Web Crawl), stored on Amazon S3 as objects and updated frequently. Because the web crawl dataset can be huge (usually in terabytes) and always growing, there was a need for a distributed, bottomless persistent storage. Amazon S3 proved to be a perfect fit.
How Was Amazon SQS Used
Amazon SQS was used as message-passing mechanism between components. It acts as ―glue‖ that wired different functional components together. This not only helped in making the different components loosely coupled, but also helped in building an overall more failure resilient system.
If one component is receiving and processing requests faster than other components (an unbalanced producer consumer situation), buffering will help make the overall system more resilient to bursts of traffic (or load). Amazon SQS acts as a transient buffer between two components (controllers) of the GrepTheWeb system. If a message is sent directly to a component, the receiver will need to consume it at a rate dictated by the sender. For example, if the billing system was slow or if the launch time of the Hadoop cluster was more than expected, the overall system would slow down, as it would just have to wait. With message queues, sender and receiver are decoupled and the queue service smoothens out any ―spiky‖ message traffic.
Interaction between any two controllers in GrepTheWeb is through messages in the queue and no controller directly calls any other controller. All communication and interaction happens by storing messages in the queue (en-queue) and retrieving messages from the queue (de-queue). This makes the entire system loosely coupled and the interfaces simple and clean. Amazon SQS provided a uniform way of transferring information between the different application components. Each controller’s function is to retrieve the message, process the message (execute the function) and store the message in other queue while they are completely isolated from others.
(TO BE CONTINUED)
Amazon Web Services