{"id":3379,"date":"2015-06-15T17:56:15","date_gmt":"2015-06-15T23:56:15","guid":{"rendered":"http:\/\/benincosa.com\/?p=3379"},"modified":"2015-06-16T14:37:34","modified_gmt":"2015-06-16T20:37:34","slug":"data-analysis","status":"publish","type":"post","link":"https:\/\/benincosa.com\/?p=3379","title":{"rendered":"Data Analysis"},"content":{"rendered":"<p>Data analytics is the hottest thing around. \u00a0Knowing how to structure and manipulate data, and then find answers is the <a href=\"https:\/\/twitter.com\/JulieWojewoda\/status\/608701948481523712\/photo\/1\">sexiest job out on the market today<\/a>!<\/p>\n<figure style=\"width: 1024px\" class=\"wp-caption alignnone\"><img decoding=\"async\" loading=\"lazy\" class=\"\" src=\"https:\/\/pbs.twimg.com\/media\/CHKLMVbVAAA6I59.jpg:large\" alt=\"\" width=\"1024\" height=\"768\" \/><figcaption class=\"wp-caption-text\">Cisco Live<\/figcaption><\/figure>\n<p>Before getting into cool things like apache spark, hadoop, or using EMR, let&#8217;s just start out with a basic example: \u00a0Word count on a book.<\/p>\n<p>Project Guttenburg has a ton of books out there. \u00a0I&#8217;m going to choose one and then just do some basic text manipulation. \u00a0Not everything needs big data and there&#8217;s a lot you can do just from your own laptop.<\/p>\n<p>I&#8217;m going to borrow from a<a href=\"http:\/\/www.michael-noll.com\/tutorials\/writing-an-hadoop-mapreduce-program-in-python\/\"> great post here<\/a> and do some manipulation on some text I found. \u00a0I&#8217;m not going to use any hadoop. \u00a0Just plain python.<\/p>\n<p>I&#8217;ll use the same script, but I stripped out the punctuation with some help from <a href=\"http:\/\/stackoverflow.com\/questions\/5843518\/remove-all-special-characters-punctuation-and-spaces-from-string\">Stack Overflow<\/a>:<\/p>\n<pre class=\"lang:python decode:true\" title=\"mapper.py\">!\/usr\/bin\/env python\r\n\r\nimport sys \r\n\r\n# input comes from STDIN (standard input)\r\nfor line in sys.stdin:\r\n    # remove leading and trailing whitespace\r\n    line = line.strip()\r\n    # split the line into words\r\n    words = line.split()\r\n    # increase counters\r\n    for word in words:\r\n        w = ''.join(e for e in word if e.isalnum())\r\n        # write the results to STDOUT (standard output);\r\n        # what we output here will be the input for the\r\n        # Reduce step, i.e. the input for reducer.py\r\n        #   \r\n        # tab-delimited; the trivial word count is 1\r\n        print '%s\\t%s' % (w, 1)<\/pre>\n<p>&nbsp;<\/p>\n<p>Here&#8217;s the pass:<\/p>\n<pre class=\"lang:sh decode:true \">cat ..\/data\/bom.txt | .\/mapper.py | tee -a ..\/out\/bom-out.txt<\/pre>\n<p>We&#8217;re simply printing all the text and then piping to our mapper program and then storing the results in the bom-out.txt file.<\/p>\n<p>So now we have bom-out.txt which is just a word with a count next to it:<\/p>\n<p>foo 1<\/p>\n<p>Now we need to sort it.<\/p>\n<pre class=\"lang:sh decode:true \">cat bom-out.txt | sort -k1,1 | tee bom-sorted-out.txt<\/pre>\n<p>So now we have bom-sorted-out.txt. \u00a0So next up, we do the word count.<\/p>\n<pre class=\"lang:sh decode:true \">cat ..\/out\/bom-sorted-out.txt | .\/reduce.py | tee ..\/out\/bom-reduced-out.txt<\/pre>\n<p>This gives us the output, but now let&#8217;s see which word is used the most. \u00a0This is another sort.<\/p>\n<pre class=\"lang:sh decode:true \">cat ..\/out\/bom-reduced-out.txt | sort -k2n<\/pre>\n<p>This gives some of the usual suspects:<\/p>\n<pre class=\"lang:sh decode:true \">...\r\ntheir\t2800\r\nit\t3061\r\nhe\t3145\r\nI\t3306\r\nunto\t3641\r\nin\t3674\r\nthey\t4446\r\nAnd\t4565\r\nto\t6445\r\nthat\t6842\r\nand\t11765\r\nof\t11787\r\nthe\t19120<\/pre>\n<p>We could probably do better if we were to make it so case didn&#8217;t matter. \u00a0We could then also do it in a one pass script. \u00a0Let&#8217;s try it.<\/p>\n<h3>Version 2<\/h3>\n<p>In the mapper script we change the last line to be:<\/p>\n<pre class=\"lang:sh decode:true \">print '%s\\t%s' % (w.lower(), 1)<\/pre>\n<p>This way it spits everything out in lowercase. \u00a0Now to run the script in one line, we do the following:<\/p>\n<pre class=\"lang:sh decode:true\">cat ..\/data\/bom.txt | .\/mapper.py | sort -k1,1  | .\/reduce.py | sort -k2n | tee ..\/out\/results2.txt<\/pre>\n<p>Now our output looks a little different:<\/p>\n<pre class=\"lang:sh decode:true \">...\r\ntheir\t2807\r\nit\t3075\r\nhe\t3173\r\ni\t3306\r\nunto\t3641\r\nin\t3693\r\nthey\t4485\r\nto\t6450\r\nthat\t6864\r\nof\t11814\r\nand\t16331\r\nthe\t19230<\/pre>\n<p>What we&#8217;ve shown here is the beginning of what things like Hadoop do for us. \u00a0We have unstructured data and we apply two operations: \u00a0Map: \u00a0This is where we do the count of each word. \u00a0Reduce: \u00a0This is where we count how many times each word was done. \u00a0In this case our data set wasn&#8217;t too huge and could be done on our laptop.<\/p>\n<p>Here&#8217;s another book:<\/p>\n<pre class=\"lang:sh decode:true \">...\r\na\t8177\r\nhis\t8473\r\ni\t8854\r\nfor\t8970\r\nunto\t8997\r\nshall\t9838\r\nhe\t10420\r\nin\t12667\r\nthat\t12913\r\nto\t13562\r\nof\t34618\r\nand\t51696\r\nthe\t63924<\/pre>\n<p>There are still a few problems but this seems to work well. \u00a0The next step is to use a <a href=\"http:\/\/www.nltk.org\">natural language processing kit<\/a> and find similar phrases. \u00a0We then could dump this into HDFS and process all kinds of books. \u00a0Lots of interesting places to go from here!<\/p>\n<p>Last note: I did upload the data and scripts to <a href=\"https:\/\/github.com\/vallard\/Data-Science\/tree\/master\/01-WordCount-Python\">github<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Data analytics is the hottest thing around. \u00a0Knowing how to structure and manipulate data, and then find answers is the sexiest job out on the market today! Before getting into cool things like apache spark, hadoop, or using EMR, let&#8217;s just start out with a basic example: \u00a0Word count on a book. Project Guttenburg has&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[462],"tags":[747,748],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/benincosa.com\/index.php?rest_route=\/wp\/v2\/posts\/3379"}],"collection":[{"href":"https:\/\/benincosa.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/benincosa.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/benincosa.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/benincosa.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=3379"}],"version-history":[{"count":2,"href":"https:\/\/benincosa.com\/index.php?rest_route=\/wp\/v2\/posts\/3379\/revisions"}],"predecessor-version":[{"id":3395,"href":"https:\/\/benincosa.com\/index.php?rest_route=\/wp\/v2\/posts\/3379\/revisions\/3395"}],"wp:attachment":[{"href":"https:\/\/benincosa.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=3379"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/benincosa.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=3379"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/benincosa.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=3379"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}