Cloud Architectures (A)


 

Introduction
This paper illustrates the style of building applications using services available in the Internet cloud.
Cloud Architectures are designs of software applications that use Internet-accessible on-demand services. Applications built on Cloud Architectures are such that the underlying computing infrastructure is used only when it is needed (for example to process a user request), draw the necessary resources on-demand (like compute servers or storage), perform a specific job, then relinquish the unneeded resources and often dispose themselves after the job is done. While in operation the application scales up or down elastically based on resource needs.
This paper is divided into two sections. In the first section, we describe an example of an application that is currently in production using the on-demand infrastructure provided by Amazon Web Services. This application allows a developer to do pattern-matching across millions of web documents. The application brings up hundreds of virtual servers on-demand, runs a parallel computation on them using an open source distributed processing framework called Hadoop, then shuts down all the virtual servers releasing all its resources back to the cloud—all with low programming effort and at a very reasonable cost for the caller.
In the second section, we discuss some best practices for using each Amazon Web Service – Amazon S3, Amazon SQS, Amazon SimpleDB and Amazon EC2 – to build an industrial-strength scalable application.

Why Cloud Architectures?

Cloud Architectures address key difficulties surrounding  large-scale data processing. In traditional data processing  it is difficult to get as many machines as an application
needs. Second, it is difficult to get the machines when one needs them. Third, it is difficult to distribute and coordinate  a large-scale job on different machines, run processes on them, and provision another machine to  recover if one machine fails. Fourth, it is difficult to autoscale  up and down based on dynamic workloads. Fifth, it  is difficult to get rid of all those machines when the job is  done. Cloud Architectures solve such difficulties.
Applications built on Cloud Architectures run in-the-cloud   where the physical location of the infrastructure is  determined by the provider. They take advantage of  simple APIs of Internet-accessible services that scale ondemand,that are industrial-strength, where the complex  reliability and scalability logic of the underlying services  remains implemented and hidden inside-the-cloud. The  usage of resources in Cloud Architectures is as needed,
sometimes ephemeral or seasonal, thereby providing the highest utilization and optimum bang for the buck.

Business Benefits of Cloud Architectures

There are some clear business benefits to building  applications using Cloud Architectures. A few of these are  listed here:
1. Almost zero upfront infrastructure investment: If you  have to build a large-scale system it may cost a  fortune to invest in real estate, hardware (racks,machines, routers, backup power supplies),hardware management (power management,
cooling), and operations personnel. Because of the  upfront costs, it would typically need several rounds  of management approvals before the project could  even get started. Now, with utility-style computing,there is no fixed cost or startup cost.
2. Just-in-time Infrastructure: In the past, if you got  famous and your systems or your infrastructure did  not scale you became a victim of your own success.
Conversely, if you invested heavily and did not get  famous, you became a victim of your failure. By  deploying applications in-the-cloud with dynamic  capacity management software architects do not  have to worry about pre-procuring capacity for large scale
systems. The solutions are low risk because  you scale only as you grow. Cloud  Architectures can  relinquish infrastructure as quickly as you got them  in the first place (in minutes).
3. More efficient resource utilization: System  administrators usually worry about hardware  procuring (when they run out of capacity) and better  infrastructure utilization (when they have excess and   idle capacity). With Cloud Architectures they can manage resources more effectively and efficiently by  having the applications request and relinquish  resources only what they need (on-demand).

Examples of Cloud Architectures

There are plenty of examples of applications that could  utilize the power of Cloud Architectures. These range  from back-office bulk processing systems to web applications. Some are listed below:
 Processing Pipelines
 Document processing pipelines – convert  hundreds of thousands of documents from
Microsoft Word to PDF, OCR millions of  pages/images into raw searchable text   Image processing pipelines – create thumbnails  or low resolution variants of an image, resize
millions of images
 Video transcoding pipelines – transcode AVI to  MPEG movies
 Indexing – create an index of web crawl data 
 Data mining – perform search over millions of  records
 Batch Processing Systems
 Back-office applications (in financial, insurance  or retail sectors)
 Log analysis – analyze and generate daily/weekly reports
 Nightly builds – perform nightly automated  builds of source code repository every night in  parallel
 Automated Unit Testing and Deployment Testing
– Test and deploy and perform automated unit  testing (functional, load, quality) on different  deployment configurations every night
 Websites
 Websites that ―sleep‖ at night and auto-scale  during the day
 Instant Websites – websites for conferences or  events (Super Bowl, sports  tournaments)
 Promotion websites
 ―Seasonal Websites‖ – websites that only run  during the tax season or the holiday season  (―Black Friday‖ or Christmas)

In this paper, we will discuss one application example in detail – code-named as ―GrepTheWeb‖.

Cloud Architecture Example: GrepTheWeb

The Alexa Web Search web service allows developers to build customized search engines against the massive data that Alexa crawls every night. One of the features of their web service allows users to query the Alexa search index and get Million Search Results (MSR) back as output. Developers can run queries that return up to 10 million results.
The resulting set, which represents a small subset of all the documents on the web, can then be processed further using a regular expression language. This allows developers to filter their search results using criteria that are not indexed by Alexa (Alexa indexes documents based on fifty different document attributes) thereby giving the developer power to do more sophisticated searches. Developers can run regular expressions against the actual documents, even when there are millions of them, to search for patterns and retrieve the subset of documents that matched that regular expression.
This application is currently in production at Amazon.com and is code-named GrepTheWeb because it can ―grep‖ (a popular Unix command-line utility to search patterns) the actual web documents. GrepTheWeb allows developers to do some pretty specialized searches like selecting documents that have a particular HTML tag or META tag or finding documents with particular punctuations (―Hey!‖, he said. ―Why Wait?‖), or searching for mathematical equations (―f(x) = Σx + W‖), source code, e-mail addresses or other patterns such as ―(dis)integration of life‖.
While the functionality is impressive, for us the way it was built is even more so. In the next section, we will  zoom in to see different levels of the architecture of GrepTheWeb.

Figure 1 shows a high-level depiction of the architecture. The output of the Million Search Results Service, which is a sorted list of links and gzipped (compressed using the Unix gzip utility) in a single file, is given to GrepTheWeb as input. It takes a regular expression as a second input. It then returns a filtered subset of document links sorted and gzipped into a single file. Since the overall process is asynchronous, developers can get the status of their jobs by calling GetStatus() to see whether the execution is completed.
Performing a regular expression against millions of documents is not trivial. Different factors could combine to cause the processing to take lot of time:
 Regular expressions could be complex
 Dataset could be large, even hundreds of terabytes
 Unknown request patterns, e.g., any number of people can access the application at any given point in time
Hence, the design goals of GrepTheWeb included to scale in all dimensions (more powerful pattern-matching languages, more concurrent users of common datasets, larger datasets, better result qualities) while keeping the costs of processing down.
The approach was to build an application that not only scales with demand, but also without a heavy upfront investment and without the cost of maintaining idle machines (―downbottom‖). To get a response in a reasonable amount of time, it was important to distribute the job into multiple tasks and to perform a Distributed Grep operation that runs those tasks on multiple nodes in parallel.

image

Figure 1 : GrepTheWeb Architecture – Zoom Level 1

(TO BE CONTINUED)

Jinesh Varia
Technology Evangelist
Amazon Web Services

About sooteris kyritsis

Job title: (f)PHELLOW OF SOPHIA Profession: RESEARCHER Company: ANTHROOPISMOS Favorite quote: "ITS TIME FOR KOSMOPOLITANS(=HELLINES) TO FLY IN SPACE." Interested in: Activity Partners, Friends Fashion: Classic Humor: Friendly Places lived: EN THE HIGHLANDS OF KOSMOS THROUGH THE DARKNESS OF AMENTHE
This entry was posted in Computers and Internet and tagged , , , , . Bookmark the permalink.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s