Using Amazon’s Streaming Elastic Map Reduce and Hadoop

While we can install local copies of Hadoop and MapReduce, we can also use one of the standard services provided by Amazon, EMR (Elastic Map Reduce) and run in the cloud.

The following example runs Tom White’s Ruby programs from “Hadoop: The Definitive Guide” which can be found at GitHub as max_temperature_map.rb and max_temperature_reduce.rb and their accompaning data for temperatures for the years 1901 and 1902 (see appendix C of Tom White’s book).

(Note: To zoom and see the screenshots below in more detail, double click on them.)

First we create a bucket with the appropriate permissions, and upload the programs:

Create Buckets

Then we proceed to load the data as well:

1901 and 1902

Then use the following settings (enable logging, use termination protection, use an Amazon distribution of Hadoop that you know works, here 2.4.2).

Settings1

Configure one master and two core m1.small EC2 instance types. This job is not very CPU intensive, so three virtual machines running in Amazon’s cloud are good enough:

Amazon EC2 for EMR Settings

Set the job as a streaming job, with the mapper and reducer functions specified:

Streaming Job Settings

Now the Amazon EMR job is ready to run. Run it (you might need to retry if for some reason or other, creating and provisioning each of the machines times out).

Cluster Details

As you can see, it might take as many as ten minutes for the job to actually run, once we finalize streaming.

Steps

We can also query the Hadoop Master virtual machine, on port 9100 to see what Hadoop is up to:

Hadoop Port 9100

When Hadoop and Map Reduce completes, the result can be found in the corresponding bucket we specified:

Hadoop Results

Comments are closed