(CONTINUED FROM 7/07/16)
As it was difficult to know how much time each phase would take to execute (e.g., the launch phase decides dynamically how many instances need to start based on the request and hence execution time is unknown) Amazon SQS helped in building asynchronous systems. Now, if the launch phase takes more time to process or the monitor phase fails, the other components of the system are not affected and the overall system is more stable and highly available.
How Was Amazon SimpleDB Used
One use for a database in Cloud Architectures is to track statuses. Since the components of the system are asynchronous, there is a need to obtain the status of the system at any given point in time. Moreover, since all components are autonomous and discrete there is a need for a query-able datastore that captures the state of the system.
Because Amazon SimpleDB is schema-less, there is no need to define the structure of a record beforehand. Every controller can define its own structure and append data to a ―job‖ item. For example: For a given job, ―run email address regex over 10 million documents‖, the launch controller will add/update the ‖launch_status‖ attribute along with the ‖launch_starttime‖, while the monitor controller will add/update the ―monitor_status‖ and ‖hadoop_status‖ attributes with enumeration values (running, completed, error, none). A GetStatus() call will query Amazon SimpleDB and return the state of each controller and also the overall status of the system.
Component services can query Amazon SimpleDB anytime because controllers independently store their states–one more nice way to create asynchronous highly-available services. Although, a simplistic approach was used in implementing the use of Amazon SimpleDB in GrepTheWeb, a more sophisticated approach, where there was complete, almost real-time monitoring would also be possible. For example, storing the Hadoop JobTracker status to show how many maps have been performed at a given moment.
Amazon SimpleDB is also used to store active Request IDs for historical and auditing/billing purposes.
In summary, Amazon SimpleDB is used as a status database to store the different states of the components and a historical/log database for querying high performance data.
How Was Amazon EC2 Used
In GrepTheWeb, all the controller code runs on Amazon EC2 Instances. The launch controller spawns master and slave instances using a pre-configured Amazon Machine Image (AMI). Since the dynamic provisioning and decommissioning happens using simple web service calls, GrepTheWeb knows how many master and slave instances needs to be launched.
The launch controller makes an educated guess, based on reservation logic, of how many slaves are needed to perform a particular job. The reservation logic is based on the complexity of the query (number of predicates etc) and the size of the input dataset (number of documents to be searched). This was also kept configurable so that we can reduce the processing time by simply specifying the number of instances to launch.
After launching the instances and starting the Hadoop cluster on those instances, Hadoop will appoint a master and slaves, handles the negotiating, handshaking and file distribution (SSH keys, certificates) and runs the grep job.
Hadoop Map Reduce
Hadoop is an open source distributed processing framework that allows computation of large datasets by splitting the dataset into manageable chunks, spreading it across a fleet of machines and managing the overall process by launching jobs, processing the job no matter where the data is physically located and, at the end, aggregating the job output into a final result.
It typically works in three phases. A map phase transforms the input into an intermediate representation of key value pairs, a combine phase (handled by Hadoop itself) combines and sorts by the keys and a reduce phase recombines the intermediate representation into the final output. Developers implement two interfaces, Mapper and Reducer, while Hadoop takes care of all the distributed processing (automatic parallelization, job scheduling, job monitoring, and result aggregation).
In Hadoop, there’s a master process running on one node to oversee a pool of slave processes (also called workers) running on separate nodes. Hadoop splits the input into chunks. These chunks are assigned to slaves, each slave performs the map task (logic specified by user) on each pair found in the chunk and writes the results locally and informs the master of the completed status. Hadoop combines all the results and sorts the results by the keys. The master then assigns keys to the reducers. The reducer pulls the results using an iterator, runs the reduce task (logic specified by user), and sends the ―final‖ output back to distributed file system.
Hadoop suits well the GrepTheWeb application. As each grep task can be run in parallel independently of other grep tasks using the parallel approach embodied in Hadoop is a perfect fit.
For GrepTheWeb, the actual documents (the web) are crawled ahead of time and stored on Amazon S3. Each user starts a grep job by calling the StartGrep function at the service endpoint. When triggered, masters and slave nodes (Hadoop cluster) are started on Amazon EC2 instances. Hadoop splits the input (document with pointers to Amazon S3 objects) into multiple manageable chunks of 100 lines each and assign the chunk to a slave node to run the map task. The map task reads these lines and is responsible for fetching the files from Amazon S3, running the regular expression on them and writing the results locally. If there is no match, there is no output. The map tasks then passes the results to the reduce phase which is an identity function (pass through) to aggregate all the outputs. The ―final‖ output is written back to Amazon S3.
(TO BE CONTINUED)
Amazon Web Services