While we can install local copies of Hadoop and MapReduce, we can also use one of the standard services provided by Amazon, EMR (Elastic Map Reduce) and run in the cloud.
The following example runs Tom White’s Ruby programs from “Hadoop: The Definitive Guide” which can be found at GitHub as max_temperature_map.rb and max_temperature_reduce.rb and their accompaning data for temperatures for the years 1901 and 1902 (see appendix C of Tom White’s book).
(Note: To zoom and see the screenshots below in more detail, double click on them.)
First we create a bucket with the appropriate permissions, and upload the programs:
Then we proceed to load the data as well:
Then use the following settings (enable logging, use termination protection, use an Amazon distribution of Hadoop that you know works, here 2.4.2).
Configure one master and two core m1.small EC2 instance types. This job is not very CPU intensive, so three virtual machines running in Amazon’s cloud are good enough:
Set the job as a streaming job, with the mapper and reducer functions specified:
Now the Amazon EMR job is ready to run. Run it (you might need to retry if for some reason or other, creating and provisioning each of the machines times out).
As you can see, it might take as many as ten minutes for the job to actually run, once we finalize streaming.
We can also query the Hadoop Master virtual machine, on port 9100 to see what Hadoop is up to:
When Hadoop and Map Reduce completes, the result can be found in the corresponding bucket we specified: