(BEING CONTINUED FROM 29/03/17)
Tips for Designing a Cloud Architecture Application
1. Ensure that your application is scalable by designing each component to be scalable on its own. If every component implements a service interface, responsible for its own scalability in all appropriate dimensions, then the overall system will have a scalable base.
2. For better manageability and high-availability, make sure that your components are loosely coupled. The key is to build components without having tight dependencies between each other, so that if one component were to die (fail), sleep (not respond) or remain busy (slow to respond) for some reason, the other components in the system are built so as to continue to work as if no failure is happening.
3. Implement parallelization for better use of the infrastructure and for performance. Distributing the tasks on multiple machines, multithreading your requests and effective aggregation of results obtained in parallel are some of the techniques that help exploit the infrastructure.
4. After designing the basic functionality, ask the question ―What if this fails?‖ Use techniques and approaches that will ensure resilience. If any component fails (and failures happen all the time), the system should automatically alert, failover, and re-sync back to the ―last known state‖ as if nothing had failed.
5. Don’t forget the cost factor. The key to building a cost-effective application is using on-demand resources in your design. It’s wasteful to pay for infrastructure that is sitting idle.
Each of these points is discussed further in the context of GrepTheWeb.
Use Scalable Ingredients
The GrepTheWeb application uses highly-scalable components of the Amazon Web Services infrastructure that not only scale on-demand, but also are charged for on-demand.
All components of GrepTheWeb expose a service interface that defines the functions and can be called using HTTP requests and get back XML responses. For programming convenience small client libraries wrap and abstract the service specific code.
Each component is independent from the others and scales in all dimensions. For example, if thousands of requests hit Amazon SimpleDB, it can handle the demand because it is designed to handle massive parallel requests.
Likewise, distributed processing frameworks like Hadoop are designed to scale. Hadoop automatically distributes jobs, resumes failed jobs, and runs on multiple nodes to process terabytes of data.
Have Loosely Coupled Systems
The GrepTheWeb team built a loosely coupled system using messaging queues. If a queue/buffer is used to “wire” any two components together, it can support concurrency, high availability and load spikes. As a result, the overall system continues to perform even if parts of components become unavailable. If one component dies or becomes temporarily unavailable, the system will buffer the messages and get them processed when the component comes back up.
In GrepTheWeb, for example, if lots of requests suddenly reach the server (an Internet-induced overload situation) or the processing of regular expressions takes a longer time than the median (slow response rate of a component), the Amazon SQS queues buffer the requests durably so those delays do not affect other components.
As in a multi-tenant system is important to get statuses of message/request, GrepTheWeb supports it. It does it by storing and updating the status of your each request in a separate query-able data store. This is achieved using Amazon SimpleDB. This combination of Amazon SQS for queuing and Amazon SimpleDB for state management helps achieve higher resilience by loose coupling.
In this ‖era of tera‖ and multi-core processors, when programming we ought to think multi-threaded processes.
In GrepTheWeb, wherever possible, the processes were made thread-safe through a share-nothing philosophy and were multi-threaded to improve performance. For example, objects are fetched from Amazon S3 by multiple concurrent threads as such access is faster than fetching objects sequentially one at the time.
If multi-threading is not sufficient, think multi-node. Until now, parallel computing across large cluster of machines was not only expensive but also difficult to achieve. First, it was difficult to get the funding to acquire a large cluster of machines and then once acquired, it was difficult to manage and maintain them. Secondly, after it was acquired and managed, there were technical problems. It was difficult to run massively distributed tasks on the machines, store and access large datasets. Parallelization was not easy and job scheduling was error-prone. Moreover, if nodes failed, detecting them was difficult and recovery was very expensive. Tracking jobs and status was often ignored because it quickly became complicated as number of machines in cluster increased.
But now, computing has changed. With the advent of Amazon EC2, provisioning a large number of compute instances is easy. A cluster of compute instances can be provisioned within minutes with just a few API calls and decommissioned as easily. With the arrival of distributed processing frameworks like Hadoop, there is no need for high-caliber, parallel computing consultants to deploy a parallel application. Developers with no prior experience in parallel computing can implement a few interfaces in few lines of code, and parallelize the job without worrying about job scheduling, monitoring or aggregation.
On-Demand Requisition and Relinquishment
In GrepTheWeb each building-block component is accessible via the Internet using web services, reliably hosted in Amazon’s datacenters and available on-demand. This means that the application can request more resources (servers, storage, databases, queues) or relinquish them whenever needed.
A beauty of GrepTheWeb is its almost-zero-infrastructure before and after the execution. The entire infrastructure is instantiated in the cloud triggered by a job request (grep) and then is returned back to the cloud, when the job is done. Moreover, during execution, it scales on-demand; i.e. the application scales elastically based on number of messages and the size of the input dataset, complexity of regular expression and so-forth.
For GrepTheWeb, there is reservation logic that decides how many Hadoop slave instances to launch based on the complexity of the regex and the input dataset. For example, if the regular expression does not have many predicates, or if the input dataset has just 500 documents, it will only spawn 2 instances. However, if the input dataset is 10 million documents, it will spawn up to 100 instances.
Use Designs that Are Resilient to Reboot and Re-Launch
Rule of thumb: Be a pessimist when using Cloud Architectures; assume things will fail. In other words, always design, implement and deploy for automated recovery from failure.
In particular, assume that your hardware will fail. Assume that outages will occur. Assume that some disaster will strike your application. Assume that you will be slammed with more requests per second some day. By being pessimist, you end up thinking about recovery strategies during design time, which helps in designing an overall system better. For example, the following strategies can help in event of adversity:
1. Have a coherent backup and restore strategy for your data
2. Build process threads that resume on reboot
3. Allow the state of the system to re-sync by reloading messages from queues
4. Keep pre-configured and pre-optimized virtual images to support (2) and (3) on launch/boot
Good cloud architectures should be impervious to reboots and re-launches. In GrepTheWeb, by using a combination of Amazon SQS and Amazon SimpleDB, the overall controller architecture is more resilient. For instance, if the instance on which controller thread was running dies, it can be brought up and resume the previous state as if nothing had happened. This was accomplished by creating a pre-configured Amazon Machine Image, which when launched dequeues all the messages from the Amazon SQS queue and their states from the Amazon SimpleDB domain item on reboot.
If a task tracker (slave) node dies due to hardware failure, Hadoop reschedules the task on another node automatically. This fault-tolerance enables Hadoop to run on large commodity server clusters overcoming hardware failures.
Results and Costs
We ran several tests. Email Address Regular Expression was ran against 10 million documents. While 48 concurrent instances took 21 minutes to process, 92 concurrent instances took less than 6 min to process. This time includes instance launch time and start time of the Hadoop cluster. The total cost for 48 instances was around $5 and 92 instances was less than $10.
Instead of building your applications on fixed and rigid infrastructures, Cloud Architectures provide a new way to build applications on on-demand infrastructures.
GrepTheWeb demonstrates how such applications can be built.
Without having any upfront investment, we were able to run a job massively distributed on multiple nodes in parallel and scale incrementally based on the demand (users, size of the input dataset). With no idle time, the application infrastructure was never underutilized.
In the next section, we will learn how each of the Amazon Infrastructure Service (Amazon EC2, Amazon S3, Amazon SimpleDB and Amazon SQS) was used and we will share with you some of the lessons learned and some of the best practices.
(TO BE CONTINUED)
Amazon Web Services