Getting Started with AWS EMR

AWS Elastic Map Reduce (EMR) is basically a front end to an army of large EC2 instances running hadoop.  The idea is that it gets its data from S3 buckets, runs the jobs, and then stores it back in S3 buckets.  I skimmed through a book on doing it, but didn’t get much out of it.  You are better off learning cool algorithms, and general theory instead of specializing in EMR.  (Plus the book was dated).  Also, the EMR interface is pretty intuitive.

To first get data up to be worked on, we have to upload it to s3.  I used the s3cmd but you could use the web interface as well.  I have a mac, so I ran the below scripts to install and configure the command line:

The configure command should have tested to make sure you have access.  Once you do, you can create a bucket. I’m going to make one for storing sacred texts.

Now let’s upload a few files

Side Note:

How much does it cost to store data? Not much.  According to the pricing guide, we get charged $0.03 per GB per month.  Since this data, so far isn’t even over 1GB, we’re not hurting.  But then there’s also the other requests.  GET requests are $0.004 per 10,000 requests.  Since I’m not going to be using that, we should be ok.  There’s also the data transfer pricing.  To transfer into AWS its free.  To transfer out (via the Internet) it costs nothing for the first GB/month.

You can see how this can add up.  Suppose I had 300 TB of data.  It costs $0 to put in, but then costs $8,850 (300,000GB * $0.0295/GB) / month to sit there. That adds up to be $106,200/yr.  If you wanted to take that out of AWS then it costs  $15,000 to move it. (300,000GB * $0.050/GB)

Creating EMR cluster and running

Now let’s create an EMR cluster and run it.  EMR is really just a front end for people to launch jobs on preconfigured EC2 instances.  Its almost like a PaaS for Hadoop / Spark / etc.  The nice thing is that it comes with useful tools and special language processing tools like Pig.  (Things that Nathan Marz discourages us from using.).

First create a cluster.

Screen Shot 2015-06-16 at 11.15.30 AM

 

We chose the sample application at the top that does word count for us.  We then modify it by telling it to read from our own directory (s3://sacred-texts/texts/).  This will then load all of our texts and get the word count of each of the files. Screen Shot 2015-06-16 at 12.46.05 PM

The cluster then provisions and we wait for the setup to complete.  The setup takes a lot longer than the actual job takes to run! The job soon finishes:

Screen Shot 2015-06-16 at 12.43.36 PM

Once done we can look at our output.  Its all in the s3 bucket we told it to go to. Traversing the directory we have a bunch of output files:

Screen Shot 2015-06-16 at 12.53.07 PM

Each one of these is a word count of each of the parts: (some of part-0000 is shown below)

This is similar to what we did in the previous post, but we used hadoop and we did it over more files than just one piece of text.  We also wrote no code to do this.  However, its not giving us the most meaningful information.  In fact, this output doesn’t give us the combined info.  To do that, we can process it by combining it and then using a simple unix sort on it:

Now there are so many questions we could start asking with data and when you have computing power to help you ask these questions.  For example, we can search twitter for ‘happy because’ and find out what people are happy about.  Or ‘bummed’ or ‘sad because’ and find out why people are sad using simple word counts.

At the end I deleted all my stuff on s3

I had to clear the logs as well.  How much did this cost?

Well, I had to do it three times to get it right.  Each time it launched a cluster with 3 m3.xlarge sizes.  If we were using them as standard EC2 instances then it would be $0.280/hr, but since we used them for EMR, it only costs us $0.07/hr.  So 9 * 0.07 = $0.63 to try that out.

You can see how this can be a pretty compelling setup for small data and for experimenting.  This is the main point.  Experimenting is great with EMR but when it comes to any scale of infrastructure, the costs can get high pretty quick.  Especially if you are always churning the data and constantly creating new batches with jobs running all the time as new data comes in.

If you are curious, I put the data out on github.  Also, to note, the total cost for this experiment was about $0.66 ($0.63 for EMR instances + $0.03 for S3 storage). Pretty cheap way to get into the world of big data!