This paper illustrates the style of building applications using services available in the Internet cloud.
Cloud Architectures are designs of software applications that use Internet-accessible on-demand services. Applications built on Cloud Architectures are such that the underlying computing infrastructure is used only when it is needed (for example to process a user request), draw the necessary resources on-demand (like compute servers or storage), perform a specific job, then relinquish the unneeded resources and often dispose themselves after the job is done. While in operation the application scales up or down elastically based on resource needs.
This paper is divided into two sections. In the first section, we describe an example of an application that is currently in production using the on-demand infrastructure provided by Amazon Web Services. This application allows a developer to do pattern-matching across millions of web documents. The application brings up hundreds of virtual servers on-demand, runs a parallel computation on them using an open source distributed processing framework called Hadoop, then shuts down all the virtual servers releasing all its resources back to the cloud—all with low programming effort and at a very reasonable cost for the caller.
In the second section, we discuss some best practices for using each Amazon Web Service – Amazon S3, Amazon SQS, Amazon SimpleDB and Amazon EC2 – to build an industrial-strength scalable application.
Why Cloud Architectures?
Cloud Architectures address key difficulties surrounding large-scale data processing. In traditional data processing it is difficult to get as many machines as an application
needs. Second, it is difficult to get the machines when one needs them. Third, it is difficult to distribute and coordinate a large-scale job on different machines, run processes on them, and provision another machine to recover if one machine fails. Fourth, it is difficult to autoscale up and down based on dynamic workloads. Fifth, it is difficult to get rid of all those machines when the job is done. Cloud Architectures solve such difficulties.
Applications built on Cloud Architectures run in-the-cloud where the physical location of the infrastructure is determined by the provider. They take advantage of simple APIs of Internet-accessible services that scale ondemand,that are industrial-strength, where the complex reliability and scalability logic of the underlying services remains implemented and hidden inside-the-cloud. The usage of resources in Cloud Architectures is as needed,
sometimes ephemeral or seasonal, thereby providing the highest utilization and optimum bang for the buck.
Business Benefits of Cloud Architectures
There are some clear business benefits to building applications using Cloud Architectures. A few of these are listed here:
1. Almost zero upfront infrastructure investment: If you have to build a large-scale system it may cost a fortune to invest in real estate, hardware (racks,machines, routers, backup power supplies),hardware management (power management,
cooling), and operations personnel. Because of the upfront costs, it would typically need several rounds of management approvals before the project could even get started. Now, with utility-style computing,there is no fixed cost or startup cost.
2. Just-in-time Infrastructure: In the past, if you got famous and your systems or your infrastructure did not scale you became a victim of your own success.
Conversely, if you invested heavily and did not get famous, you became a victim of your failure. By deploying applications in-the-cloud with dynamic capacity management software architects do not have to worry about pre-procuring capacity for large scale
systems. The solutions are low risk because you scale only as you grow. Cloud Architectures can relinquish infrastructure as quickly as you got them in the first place (in minutes).
3. More efficient resource utilization: System administrators usually worry about hardware procuring (when they run out of capacity) and better infrastructure utilization (when they have excess and idle capacity). With Cloud Architectures they can manage resources more effectively and efficiently by having the applications request and relinquish resources only what they need (on-demand).
Examples of Cloud Architectures
There are plenty of examples of applications that could utilize the power of Cloud Architectures. These range from back-office bulk processing systems to web applications. Some are listed below:
Document processing pipelines – convert hundreds of thousands of documents from
Microsoft Word to PDF, OCR millions of pages/images into raw searchable text Image processing pipelines – create thumbnails or low resolution variants of an image, resize
millions of images
Video transcoding pipelines – transcode AVI to MPEG movies
Indexing – create an index of web crawl data
Data mining – perform search over millions of records
Batch Processing Systems
Back-office applications (in financial, insurance or retail sectors)
Log analysis – analyze and generate daily/weekly reports
Nightly builds – perform nightly automated builds of source code repository every night in parallel
Automated Unit Testing and Deployment Testing
– Test and deploy and perform automated unit testing (functional, load, quality) on different deployment configurations every night
Websites that ―sleep‖ at night and auto-scale during the day
Instant Websites – websites for conferences or events (Super Bowl, sports tournaments)
―Seasonal Websites‖ – websites that only run during the tax season or the holiday season (―Black Friday‖ or Christmas)
In this paper, we will discuss one application example in detail – code-named as ―GrepTheWeb‖.
Cloud Architecture Example: GrepTheWeb
The Alexa Web Search web service allows developers to build customized search engines against the massive data that Alexa crawls every night. One of the features of their web service allows users to query the Alexa search index and get Million Search Results (MSR) back as output. Developers can run queries that return up to 10 million results.
The resulting set, which represents a small subset of all the documents on the web, can then be processed further using a regular expression language. This allows developers to filter their search results using criteria that are not indexed by Alexa (Alexa indexes documents based on fifty different document attributes) thereby giving the developer power to do more sophisticated searches. Developers can run regular expressions against the actual documents, even when there are millions of them, to search for patterns and retrieve the subset of documents that matched that regular expression.
This application is currently in production at Amazon.com and is code-named GrepTheWeb because it can ―grep‖ (a popular Unix command-line utility to search patterns) the actual web documents. GrepTheWeb allows developers to do some pretty specialized searches like selecting documents that have a particular HTML tag or META tag or finding documents with particular punctuations (―Hey!‖, he said. ―Why Wait?‖), or searching for mathematical equations (―f(x) = Σx + W‖), source code, e-mail addresses or other patterns such as ―(dis)integration of life‖.
While the functionality is impressive, for us the way it was built is even more so. In the next section, we will zoom in to see different levels of the architecture of GrepTheWeb.
Figure 1 shows a high-level depiction of the architecture. The output of the Million Search Results Service, which is a sorted list of links and gzipped (compressed using the Unix gzip utility) in a single file, is given to GrepTheWeb as input. It takes a regular expression as a second input. It then returns a filtered subset of document links sorted and gzipped into a single file. Since the overall process is asynchronous, developers can get the status of their jobs by calling GetStatus() to see whether the execution is completed.
Performing a regular expression against millions of documents is not trivial. Different factors could combine to cause the processing to take lot of time:
Regular expressions could be complex
Dataset could be large, even hundreds of terabytes
Unknown request patterns, e.g., any number of people can access the application at any given point in time
Hence, the design goals of GrepTheWeb included to scale in all dimensions (more powerful pattern-matching languages, more concurrent users of common datasets, larger datasets, better result qualities) while keeping the costs of processing down.
The approach was to build an application that not only scales with demand, but also without a heavy upfront investment and without the cost of maintaining idle machines (―downbottom‖). To get a response in a reasonable amount of time, it was important to distribute the job into multiple tasks and to perform a Distributed Grep operation that runs those tasks on multiple nodes in parallel.
Figure 1 : GrepTheWeb Architecture – Zoom Level 1
(TO BE CONTINUED)
Amazon Web Services