Data Analysis

Data analytics is the hottest thing around.  Knowing how to structure and manipulate data, and then find answers is the sexiest job out on the market today!

Cisco Live

Before getting into cool things like apache spark, hadoop, or using EMR, let’s just start out with a basic example:  Word count on a book.

Project Guttenburg has a ton of books out there.  I’m going to choose one and then just do some basic text manipulation.  Not everything needs big data and there’s a lot you can do just from your own laptop.

I’m going to borrow from a great post here and do some manipulation on some text I found.  I’m not going to use any hadoop.  Just plain python.

I’ll use the same script, but I stripped out the punctuation with some help from Stack Overflow:


Here’s the pass:

We’re simply printing all the text and then piping to our mapper program and then storing the results in the bom-out.txt file.

So now we have bom-out.txt which is just a word with a count next to it:

foo 1

Now we need to sort it.

So now we have bom-sorted-out.txt.  So next up, we do the word count.

This gives us the output, but now let’s see which word is used the most.  This is another sort.

This gives some of the usual suspects:

We could probably do better if we were to make it so case didn’t matter.  We could then also do it in a one pass script.  Let’s try it.

Version 2

In the mapper script we change the last line to be:

This way it spits everything out in lowercase.  Now to run the script in one line, we do the following:

Now our output looks a little different:

What we’ve shown here is the beginning of what things like Hadoop do for us.  We have unstructured data and we apply two operations:  Map:  This is where we do the count of each word.  Reduce:  This is where we count how many times each word was done.  In this case our data set wasn’t too huge and could be done on our laptop.

Here’s another book:

There are still a few problems but this seems to work well.  The next step is to use a natural language processing kit and find similar phrases.  We then could dump this into HDFS and process all kinds of books.  Lots of interesting places to go from here!

Last note: I did upload the data and scripts to github.