Cisco Spark APIs

I’m super excited that Cisco Spark finally gave us some APIs that we can develop into 3rd party apps.  As I’ve been doing more development and have several projects I need to do, we’ve been using Cisco Spark for collaboration.  Not having APIs was the one thing keeping me off of it for several projects.

Today I started doing the DevNet tutorials but as I already know how to work with APIs I didn’t want to just do the postman stuff so I started writing some libraries in Go.  I posted what I’ve done so far up on Github.  There’s not too much there yet, but as I start to integrate things I expect I’ll have more do do on this.

One thing that has caused frustration with me is the inability to delete chat rooms.  I have one room that’s up that I can’t even log in to see what it is.  I wish this room would just go away.  I’m thinking this may be user error but wish I could figure it out easier.

Finally, I should mention the DevNet tutorials are excellent to work with.  They have a promotion now of giving away Github credits if you finish the 3 tutorials.  I can’t wait to see what happens next 🙂

Finally Finally:  If you have Jenkins integration with Spark already done, please let me know!  I’d like to use it!


AWS Re:Invent 2015

AWS was an amazing conference.  All of my notes of the events I went to are here. (Scroll down to read the file)

Just some quick overall thoughts:

1.  Compared to Cisco, AWS really skips out on the food and entertainment.  I mean, come on, we had Aerosmith at Cisco Live and AWS gives us what?  I can’t even remember the name.  Doesn’t really matter, cause I went home that night anyway.

2.  This should be a longer event.  There were too many sessions I wanted to attend.  I was fortunate enough to attend an IoT bootcamp and that could have easily gone another day if they would have added some analysis.  I wish it would have.

3.  The announcements never stopped.  I lost count around 20, but there were a ton of new features and services.  Take Amazon snowball:  $200 to send 50TB into AWS.  Best comment on that?  Costs $1500 to move it out.  (50,000 * $0.03)

4.  The biggest surprise to me was hearing the amount of customers that use the Cisco CSR1000v.  It’s not my product to know, so I don’t feel bad saying this.  I didn’t think there were so many users of it!  Wow.  The use case was “Transitive Routing”.  Imagine having 3 VPCs.  One of them is externally connected.  Placing one pair of CSR 1000vs in that externally connected VPC allows for the other VPCs to communicate to each other using BGP internally.  Pretty cool.

5.  Everyone is in trouble.  When Amazon QuickSight was announced I thought:  Wow, if you’re into analytics in the cloud you are in trouble. I don’t know which companies may have been effected by that, but I suspect they are the tip of the iceberg.  Take New Relic for example.  Right now they are doing really well for admin analytics.  How long before AWS puts a service to do that?

6.  What I was wondering about is if they were ever going to announce some sort of on-prem solution. The closest they got to that was Amazon Snowball, bless their hearts.  It probably doesn’t make sense for them to complicate things with that and leads to more capability of intellectual property getting loose.  After all, these are linux machines, and if a managed service happened on prem, that would be easy to get into.

7.  Look out Oracle!  Woah, that was some serious swinging.  And Oracle, you have a lot to worry about.  First of all, nobody I talk to really likes you.  People have nostalgic feelings for Sun but I’ve not really talked to people that like Oracle.  Perhaps that’s because I don’t talk to database administrators as much.  But guess what?  Nobody really likes them either.  So you have a hated product ran by hated people.  Probably won’t take long for people to dump that when refresh season comes up.

8.  Lambda.  Last year, AWS introduced Lambda.  I don’t think people still really get how important Lambda is.  Its the glue that makes a serverless architecture in AWS work.  “The easiest server to manage is no server” said Werner Vogels.  This is the real future of the cloud.  Like my previous post on getting rid of the operating system said;  managing operating systems is the last mile.  VMS are a thing of the past.  Even containers are less exciting when you think about a serverless architecture.  Just a place to execute code and APIs to do all the work.  Database, storage, streaming, any service you want is just an API.  Where AWS lambda fails in my book is that its limited to only AWS services.  Imagine if this were available to extend to any cloud service.  That to me would be the real Intercloud Cisco dreams about. As more cloud APIs develop, extending Lambda to an “API Store” is something more people would find value in.  Amazon probably wouldn’t because it means people using non-AWS services.  But this is where I would be investing if I were trying to compete against AWS.  Nothing else seems to be working.

Anyway, that’s my take.  What did you think?


Using Packer with Cisco Metapod

Packer is such a cool tool for creating images that are used with any cloud provider.  With the same packer recipe you can create common images on AWS, Digital Ocean, OpenStack, and more!  The Packer documentation already comes with an example of running with Metapod (formerly Metacloud).

I wanted to show an example of how I use Packer to create images for both AWS and OpenStack.  Without any further commentary, here is what it looks like:

There are two sections to this file.  The builders:  AWS and OpenStack, and the provisioners (the code that gets executed on both of them.)  We specify as many different platforms as we have and then tell it what to do.  When we are finished executing this script, our Jenkins slaves will be ready on all of our platforms.  This is super nice because now we just point our robots to which ever system we want and we have automated control!

With Metapod there were a few hooks:

  • By default images are not created with public IP addresses, so you need to grab one for it to work correctly.
  • Packer actually creates a running image but then logs into it via SSH to configure it.  So this is required if you are running this from outside the firewall!

Hybrid Cloud isn’t about moving workloads between two clouds, its about having tools that can operate and deploy and tear down workloads to any cloud.  Packer is a great foundation tool because it creates the same image in both places.


It’s time to get rid of the operating system

We’ve abstracted many things you and I.  But its time to crack the next nut.  Let me explain why.

Bare Metal

I started out managing applications on bare metal x86 and AIX servers.  In those days we were updating the operating systems to the latest releases, patching with security updates, and ensuring our dependencies were set up in such a way that the whole system would run.  In a way, its pretty amazing that the whole complex stack operated as well as it did.  So many dependencies, abstracted away so that we didn’t have to worry about the application opening up the token ring port and we could just use a trusted stack to send things on its way.  Life was good, but it sucked at the same time:  It was slow to provision servers, boot times were atrocious especially when UEFI made boot times even longer, and it was monolithic and scary to touch.

I remember one company I visited had their entire business records (not backed up) on one server that was choking under the load.  I came in there and pxebooted that server, ran some partimage magic and migrated the server to another big beefier server.  It was the first cold migration I had ever performed.  I felt like an all powerful sorcerer.  I had intergalactic cosmic powers.  High Fives all around.


I started virtualization with Xen, then KVM.  VMware was something the windows guys were doing, so I wasn’t interested.  Then I realized it was based on Red Hat Linux (it has since changed) so my powers were summoned again. I started helping get application running on Virtual Machines.  I did the same thing:  I installed the operating system to the latest release.  I applied security patches, and then I made sure the latest dependencies were set up in such a way that the application would run.

But here my magical powers were obsolete.  People were desensitized to vMotion demos.  Oh, you had to shut off that machine to migrate it to another server?  We can migrate it while its still running.  Watch this, you won’t even drop a ping.  People started making the case that hardware was old news.  Why do you care which brand it is?

We could make new VMs quickly, more efficiently use our servers, (remember how much people said your power bill would drop?), and maintenance of hardware was easier.

All our problems were solved, as long as we paid VMware our dues.

Cloud IaaS

But VMware was for infrastructure guys.  The new breed of hipsters and brogrammers were like:  I don’t care what my virtualization is, nor my hardware, as long as its got an API, I can spin it up.  So we would start getting our applications working on AWS.  But here we as systems dudes started abusing Ruby to create super cool tools like Puppet or Chef and started preaching the gospel of automation.  And so when I would get applications running on the cloud, I would install the latest operating system, apply security patches, make sure my latest application dependencies were there and then run the operating system.

My magical powers of scripting came back in force.  I was a scripting machine.  Now I didn’t care about the hardware, nor the virtualization platform.  I just got things working.  All the problems were solved as long as I paid my monthly AWS bill.

Adrian Cockcroft told me and thousands of my closest friends at conferences that I didn’t have to worry about the cost, because I needed to optimize for speed.  If I optimized for speed projects would get done ahead of time so I would save money and because my projects were iterating quickly I would make more money.  We took our scripts and fed them to robots like Jenkins so we could try all our experiments.  We would take the brightest minds of our days and instead of having them work out ways to get people to Mars or find alternate power sources we would have them figure out how to get people to click on ads to make us money.  God bless us, everyone.

But we still had to worry about the operating system.

Cloud PaaS

We took a side look at PaaS for a second.  Because they said we wouldn’t have to care about the OS, nor the dependencies because they would manage it for us.  The problem was

1.  Our applications were artistic snowflakes.  We needed our own special libraries and kernel versions.  You think our applications are generic?  Yeah, it was great if we were setting up a wordpress blog, but we’re doing real science here:  We’re getting people to click on ads for A/B testing.  ‘Murica.  So your PaaS wasn’t good enough for us.  And our guys know how to do it better.

2.  We heard that it wouldn’t scale.  Let alone that we were using Python and Ruby that was never really meant to scale into the atrocious applications that became Twitter and others.  Typed languages are so easy, so we used them.

So for our one-offs, we still use PaaS but for the most part, we still install operating systems to the latest versions, install security policies and patches, and ensure our dependencies are up so we can run the applications.

We weren’t supposed to worry about the operating system, but we did.

Containers and Microservices

A cute little whale, handcrafted with love in San Francisco, then stole our hearts by making it easy to do things that we were able to do years ago.  The immutable architecture with loads of services in separate pieces would come our way and save us from the monolith.  A container a day keeps the monolith away is what it says on the T-shirt they were handing out.  I got tons of cool stickers that made me feel like a kid again and I plastered them all over my computer.

I started breaking up monoliths.  At Cisco, our giant ordering tool based on legacy Oracle databases and big iron servers was broken up and each piece was more agile than the next.  We saw benefits.  Mesos and Kubernetes are the answers to managing them and Cisco’s Mantl project will even orchestrate that.  Its really cool actually!

So how do I get a modern micro services application running today?  I create a Dockerfile that has the OS.  Then I do apt-get update to make sure all the dependencies are in place.  I use Mesos or Kubernetes to expose ports for security.  Then I make sure the dependencies are installed in the Operating system in the container.  And we’re off.

Mesos even has something called the Data Center Operating System (DCOS).  It runs containers.  But containers still run Operating Systems.  We’re still worrying about the operating system for our applications!

What’s Next? 

We’re still crafting operating systems.  We’re still managing them.  We’ve started down a journey of abstraction to deliver applications, but we haven’t cracked the final nut:  We need to make the operating system irrelevant, just as we’ve made the hardware and virtualization platform irrelevant.  The things we’re doing with scheduling across other containers used to be something the OS would deliver on a single box, but that’s not happening anymore due to the distributed nature of applications.

AWS has shown us lambda which is a great start in this direction.  Its a system that just executes code.  There’s no operating system, just a configuration of services.  It’s a glimpse into the future of what the new modern day art of the possible will be.  As we start to break down these micro services deeper in to nano services or just function calls, we need to get away from having to worry about an operating system and just a platform that runs the various components of our micro service-ized application.

We’ve gotten leaner and leaner and the people that figure this out first, and give applications the best experience to run without requiring maintenance of operating systems will win the next battle of delivering applications.

We abstracted hardware, the virtualization platform, and our services.  Now its time to go to eleven:  Get rid of the operating system.

A Full Bitcoin Client

I’ve been using bitcoin for a few years now but have only used my own wallets, Coinbase, and some other stuff.  I thought I should make a full client and put it on the network!

I used Metacloud (Cisco OpenStack Private Cloud) and spun up an extra large Ubuntu 14.04 instance that we had!  I logged in and did the following:

From there I looked at the doc/ and followed the instructions. They worked perfectly!

Once that is done, you can start the bitcoin client by running:

It will tell you that you need to set a password and will suggest one for you.  Take the suggestion and create the file ~/.bitcoin/bitcoin.conf and copy the password in there.  It might look like:

Now you can start it:

This will take a while to download the blockchain.  You can see all the blocks as they’ll be downloaded to the ~/.bitcoin/blocks directory.

But now we have it!  A node in the bitcoin service!

Kubernetes on Metacloud (COPC)

The Kubernetes 1.0 launch happened July 21st at OSCON here in Portland, OR and I was super happy to be there in the back of the room picking up loads of free stickers while the big event happened.  I spent the day before at a Kubernetes bootcamp, which was really just a lab on using it on GCE (or GKE for containers) and it was pretty cool.  But now I felt I really should do a little more to understand it.

To install Kubernetes on Metacloud (or what Cisco now calls Cisco OpenStack Private Cloud) I’m using CoreOS.  I like CoreOS because its lightweight and built for containers.  There are a few guides out there like the one on Digital Ocean that is pretty outdated (not even a year old!) that was good.  For installing Kubernetes on CoreOS on OpenStack its pretty easy now!

I should note, that I’m using Cisco OpenStack Private Cloud, but these steps can be used with any OpenStack distribution.  I followed most of the documentation based on the Kubernetes documentation.  (You’ll notice on there site that there is no instructions for OpenStack.  I opened an issue of which I hope to help with).

Anyway, the gist is here with all the instructions, but I’m more of the mindset to use Ansible.

Install Kubernetes

First download the cloud-init files that Kelsey Hightower created.  These make installing this super simple.  Get the master and the node.

We then create a master task that looks something like this:

Here I’m heavily using environment variables that should be defined elsewhere.  I call them out with a vars_file that has most of these.  The credentials are stored in the ~/.bash_profile and so live externally to the vars_file.  That’s where we keep our username, endpoints, and password.

You’ll have to have a CoreOS image already created in your cloud to use this.  I got mine from here.  Then I used glance and uploaded it.

The user_data points to use the file that was created by the Kubernetes community and will configure upon boot the parameters required for Kubernetes.

The minion nodes configuration is similar:

Note that you have to edit the node.yml file to point to the master (kube01 in our example).

At this point I’ve been a little lazy and didn’t go do the variable substitution.  Someday, I’ll get around to that.  But as a hint, since we registered ‘nova’ in the first task we can get the private IP address with this flag:

Just put that after the creation of the master.

The github repo for this is here.

Using Kubectl

Once our cluster is installed we can now run stuff.  I have a mac so I set it up like this following the instructions here:

Now we set the proxy up so we can run kubectl on our master:

Make sure that when you check your path, kubectl from /usr/local/bin shows up instead of maybe one from Google’s GCE stuff.

Check that it works by running:

Now we can launch something!  Let’s use the Hello World example on the Kubernetes documentation site.  Create this file and name it hello-world.yaml

Then create it:

There are several other examples as well and I encourage you to go to the official guides!  Let me know if this was helpful to you with a quick hello on Twitter!


Go with NX-API

I’ve been working on a project to collect data from Cisco Nexus switches.  I first tackled making SNMP calls to collect counter statistics but then I thought, why not try it with the NX-API that came with the Nexus 9ks?

The documentation for the APIs I hoped would be better, but the samples on github were enough to get anybody going… as long as you do it in Python.  But these days the cool systems programmers are moving to Go for several reasons:

  1. The concurrency capabilities will take your breath away.  Yes, I go crazy with all the go routines and channels.  Its a really cool feature in the language and great for when you want to run a lot of tasks in parallel.  (Like logging into a bunch of switches and capturing data maybe? )
  2. The binaries can be distributed without any hassle of dependencies.  Why does this matter?  Well, for example, if I want to run the python-novaclient commands on my machine, I have to first install python and its dependencies, then run pip to install the packages and those dependencies.  I’m always looking for more lube to make trying new things out easier.  Static binaries ease the friction.

After playing around with the switch I finally got something working so I thought I’d share it.  The code I developed is on my sticky pipe project.  For the TL;DR version of this post check out the function getNXAPIData for the working stuff.  The rest of this will walk through making a call to the switch.

1.  Get the parameters.

You’ll have to figure out a way to get the username and password from the user.  In my program I used environment variables, but you may also want to take command line variables.  There are lots of places on the internet you can find that so I’m not going into that with much detail other than something simple like:

2. Create the message to send to the Nexus

The NX-API isn’t a RESTful API.  Instead, you just enter Nexus commands like you would if you were on the command line.  The  NX-API then responds with output back in JSON notation.  You can also get XML, but why in the world would you do that to yourself?  XML was cool like 10 years ago, but let’s move on people!  There’s a JSON RPC format, but I don’t get what this gives you that JSON doesn’t other than order by adding flags to order things.  Stick with JSON and your life will not suck.

Here’s how we do that:

This format of a message seems to handle any JSON that we want to throw at the Nexus.  This is really all you need to send your go robots forth to manage your world.

3.  Connect to the Nexus Switch

I start by creating an http.NewRequest.  The parameters are

  • POST – This is the type of HTTP request I’m sending
  • The switch – This needs to be either http or https (I haven’t tried with https yet).  Then the switch IP address (or hostname) has to be terminated with the /ins directory.  See the example below.
  • The body of the POST request.  This is the bytes.NewBuffer(jsonStr) that we created in the previous step.

After checking for errors, now we need to set some headers, including the credentials.

This header also tells us that we are talking about JSON data.

Finally, we make the request and check for errors, etc:

That last line with the defer statement closes the connection after we leave this function.  Launching this should actually get you the command executed that you are looking to do.  Closing is important cause if you’re banging that switch a lot, you don’t want zombie connections blocking your stack. Let him that readeth understand.

Step 4: Parse the Output

At this point, we should be able to get something to talk to the switch and have all kinds of stuff show up in the rest.Body.  You can see the raw output with something like the following:

But most likely you’ll want to get the information from the JSON output.  To do that I created a few structures that this command should respond back with nearly every time.  I put those in a separate class called nxapi.  Then I call those from my other functions as will be shown later.  Those structs are:

These structs map with what the NXAPI usually always returns in JSON:

The outputs may also be an array if there are multiple commands entered.  (At least that’s what I saw via the NX-API developer sandbox.   If there is an error then instead of Body for the output you’ll see “clierror”)

So this should be mostly type safe.  It may be better to omit the Body from the type Output struct.

Returning to our main program, we can get the JSON data and load it into a struct where we can parse through it.

In the above command, I’m looking to parse output from the “show version” command.  When I find that the input was the show version command, then I can use the keys from the body to get information from what was returned to us by the switch.  In this case the output will give us the hostname of the switch.


This brief tutorial left out all the go routines and other fanciness of Go in order to make it simple.  Once you have this part, you are ready to write a serious monitoring tool or configuration tool.  Armed with this, you can now make calls to the NX-API using Go.  Do me a favor and let me know on twitter if this was useful to you!  Thanks!


Notes from Dockercon 2015


I was fortunate enough to be able to attend my first Dockercon this year in San Francisco.  I wanted to write a thoughts I had while attending the conference.

1.  The Marketing

Docker and other startups win this so well.  Every logo is cute and the swag made for good things to take home to my kids.  But seriously, the docker whale made out of legos, the docker plush toy distributed at the day 2 keynote, the lego docker kits, the stickers, the shirts!  Wow!  I think the work that the team has done is fantastic.  When I compare that to the work we do at Cisco with a product called “Cisco OpenStack Private Cloud” I just shutter to think that we could do a lot better.

I want to call out especially the work that Laurel does for Docker.  The design, the comics, everything just worked so well and thought this was by far the star of the show.  She was even nice enough to respond to some questions I had on Twitter!

I will say this though.  There were lots of booths stacked with flashy logos, cool T-shirts and stickers, but may have been more frosting than cake.  I decided I needed to invent a product and call it Cisco Spud and make a cute potato logo and see if I could get some interest around here.

2.  Microsoft

Microsoft’s demo was beyond impressive.  If it works like it showed in the demo then this is something Microsoft developers can be really excited about.  The demo showed a fully integrated solution of running Visual Studio on a Mac, then submitting through Microsofts own continuous integration deployment, all with containers.  The demo then went on to show containers running in Azure.  Microsoft’s booth was full of swag showing love for Linux via Docker containers.  Good show by Microsoft!

I’ll add one more thing here:  Microsoft said they were the number one contributor to Docker since last April.  Now, why do you think that is?  Pretty simple:  Lots of Windows code.  Its funny how you can spin something that is in your own self serving interest as something that is good for the community.

FullSizeRender-3 IMG_8902

3.  What are people paying for?

It was pretty obvious from this conference and from a previous talk given by Adrian Cockcroft of Battery Ventures at DockerCon EU that people are not willing to pay for middleware like Docker.  I would extend that to say people don’t seem to be willing to pay for plumbing.  There were several networking companies I spoke with including Weave Networks where they are basically giving away their open source networking stacks for people to use.  That doesn’t bode well for a company like Cisco that makes its money on plumbing.  So what are people paying for and what can we learn from DockerCon?

  1. Subscriptions to Enterprise versions of Free things.  People are paying for subscription services and support like RedHat has shown.  Docker introduced its commercial trusted  registries for businesses.  This is great for people who need a little hand holding and want a nice supported version of the Registry.  Its not too hard for an organization to just spin one of these up themselves (as I showed in a previous blog post) but that is froth with security problems and cumbersome to secure.  Consumption Economics FTW.  But it seems the key is to launch a successful open source product and then offer the commercial support package.
  2. Logging & Analytics.  As shown by Splunk and others people are still willing to pay to visualize the logs, data, and to manage all the overwhelming information.  I thought this slide shown by Ben Golub was insightful for the enterprise.  People are looking to harness big data, logging, analytics.  I was surprised to see how high HortonWorks was in this.  There were several visualization companies in the booths for which I’m sad I didn’t have time to talk to all of them. FullSizeRender-2
  3. Cloud Platforms and Use based Services.  This should be no surprise, but what was surprising were the number of talks I attended where Docker was used on Prem in customers own data centers.  I was half expecting this conference to be an AWS love fest as well, but it wasn’t.  With Azure’s show of containers in the marketplace and AWSs continued development of ECS we have a sure place that companies can make money:  Offering a cloud platform where people can run these darn things!

Maybe there were other things you noticed there that people were willing to pay for? Not T-shirts!  Those were given out as freely as the software!

4.  The future of PaaS

I’ve been a strong proponent of how current PaaS platforms are doomed and already irrelevant.  I think Cloud Foundry and OpenShift may have some relevance today, but I certainly see no need for them (yes, I’m myopic, yes, I lack vision, fine).  Instead Containers provide the platform as a Service required.  While several PaaS vendors were on site to show their open source wares, I just don’t get why I need it when I can just have a set of containers available and people can use those.  This was further cemented by the demonstration of Docker’s Project Orca.


Orca made me quickly forget all of Microsoft’s shiny demos.  This was what I was really expecting to see unveiled at DockerCon: The vCenter of Docker containers.  While this is still in locked down mode, the demo was great.  It had a lot of the features you’d want to be able to see where your containers are, what they’re running, etc. If there were a mode where users could login and then get a self service view of containers and what they can launch, this would be all you needed from a PaaS.  Maybe that’s what OpenShift and Cloud Foundry do today with a lot of extra bloat, but I am expecting big things from this project as well as another monetization stream for Docker.  As Scott Johnston said:  “There’s got to be a commercial offering in here somewhere!” when announcing the commercial offerings, I suspect this one could eventually lead to even greater revenues.

The Vision

I had a front row seat as you can see :-)

I enjoyed the opening keynote presentation by Solomon Hykes best.  He laid out 3 Goals and some subgoals while introducing various things you’ve probably already read about (appC, Docker Networking, etc. )

Goal 1: Program the Internet for the Next 5 years.

  1. Runtime: a container (Docker Engine)
  2. Package and Distribution (Docker Hub)
  3. Service Composition: (Docker Compose previously loved as Fig)
  4. Machine Management (Docker Machine)
  5. Clustering (Docker Swarm)
  6. Networking (Docker Network)
  7. Extensibility (Docker Plugins)

Goal 2: Focus on Infrastructure Plumbing

  1. Break up the Monolith that is Docker, introduced RunC (which is Docker Engine, which is only 5% of existing code)
  2. The Docker Plumbing project will take a long time, but will be useful.  Make things using the Unix way:  Small, simple, single purpose built tools.
  3. Docker Notary: Secure the plumbing.  How can we do better downloads.  I was sad to see this didn’t use the BlockChain method, but maybe that’s cause I’m too much of a BitCoin zealot.

Goal 3: Promote Open Standards

This part was great.  This is where CoreOS and Docker kissed and made up on stage.  I loved the idea of an open container project, and I loved how every titan and their spinoff was a logo on it lending their support.

runC is now the default container format, and I’m expecting big things as we move forward with it.


This was a really neat conference to attend.  What I liked best, was talking to the strangers at some of the tables I dined at.  I wished I could have done more of it.  I’m not in it for the networking of people, but selfishly for the networking of ideas.  I asked a lot of questions:  Are you running a private registry?  How are you securing it? are you running in production?  What are you currently working on?  What are you trying to solve?

It’s hard though, cause I’m not a complete extrovert.  I’m more of an extroverted introvert. I’m also sad to say I didn’t do enough of it, and that will be my resolve for the next conference:  Connect with more strangers!


Getting Started with AWS EMR

AWS Elastic Map Reduce (EMR) is basically a front end to an army of large EC2 instances running hadoop.  The idea is that it gets its data from S3 buckets, runs the jobs, and then stores it back in S3 buckets.  I skimmed through a book on doing it, but didn’t get much out of it.  You are better off learning cool algorithms, and general theory instead of specializing in EMR.  (Plus the book was dated).  Also, the EMR interface is pretty intuitive.

To first get data up to be worked on, we have to upload it to s3.  I used the s3cmd but you could use the web interface as well.  I have a mac, so I ran the below scripts to install and configure the command line:

The configure command should have tested to make sure you have access.  Once you do, you can create a bucket. I’m going to make one for storing sacred texts.

Now let’s upload a few files

Side Note:

How much does it cost to store data? Not much.  According to the pricing guide, we get charged $0.03 per GB per month.  Since this data, so far isn’t even over 1GB, we’re not hurting.  But then there’s also the other requests.  GET requests are $0.004 per 10,000 requests.  Since I’m not going to be using that, we should be ok.  There’s also the data transfer pricing.  To transfer into AWS its free.  To transfer out (via the Internet) it costs nothing for the first GB/month.

You can see how this can add up.  Suppose I had 300 TB of data.  It costs $0 to put in, but then costs $8,850 (300,000GB * $0.0295/GB) / month to sit there. That adds up to be $106,200/yr.  If you wanted to take that out of AWS then it costs  $15,000 to move it. (300,000GB * $0.050/GB)

Creating EMR cluster and running

Now let’s create an EMR cluster and run it.  EMR is really just a front end for people to launch jobs on preconfigured EC2 instances.  Its almost like a PaaS for Hadoop / Spark / etc.  The nice thing is that it comes with useful tools and special language processing tools like Pig.  (Things that Nathan Marz discourages us from using.).

First create a cluster.

Screen Shot 2015-06-16 at 11.15.30 AM


We chose the sample application at the top that does word count for us.  We then modify it by telling it to read from our own directory (s3://sacred-texts/texts/).  This will then load all of our texts and get the word count of each of the files. Screen Shot 2015-06-16 at 12.46.05 PM

The cluster then provisions and we wait for the setup to complete.  The setup takes a lot longer than the actual job takes to run! The job soon finishes:

Screen Shot 2015-06-16 at 12.43.36 PM

Once done we can look at our output.  Its all in the s3 bucket we told it to go to. Traversing the directory we have a bunch of output files:

Screen Shot 2015-06-16 at 12.53.07 PM

Each one of these is a word count of each of the parts: (some of part-0000 is shown below)

This is similar to what we did in the previous post, but we used hadoop and we did it over more files than just one piece of text.  We also wrote no code to do this.  However, its not giving us the most meaningful information.  In fact, this output doesn’t give us the combined info.  To do that, we can process it by combining it and then using a simple unix sort on it:

Now there are so many questions we could start asking with data and when you have computing power to help you ask these questions.  For example, we can search twitter for ‘happy because’ and find out what people are happy about.  Or ‘bummed’ or ‘sad because’ and find out why people are sad using simple word counts.

At the end I deleted all my stuff on s3

I had to clear the logs as well.  How much did this cost?

Well, I had to do it three times to get it right.  Each time it launched a cluster with 3 m3.xlarge sizes.  If we were using them as standard EC2 instances then it would be $0.280/hr, but since we used them for EMR, it only costs us $0.07/hr.  So 9 * 0.07 = $0.63 to try that out.

You can see how this can be a pretty compelling setup for small data and for experimenting.  This is the main point.  Experimenting is great with EMR but when it comes to any scale of infrastructure, the costs can get high pretty quick.  Especially if you are always churning the data and constantly creating new batches with jobs running all the time as new data comes in.

If you are curious, I put the data out on github.  Also, to note, the total cost for this experiment was about $0.66 ($0.63 for EMR instances + $0.03 for S3 storage). Pretty cheap way to get into the world of big data!

Data Analysis

Data analytics is the hottest thing around.  Knowing how to structure and manipulate data, and then find answers is the sexiest job out on the market today!

Cisco Live

Before getting into cool things like apache spark, hadoop, or using EMR, let’s just start out with a basic example:  Word count on a book.

Project Guttenburg has a ton of books out there.  I’m going to choose one and then just do some basic text manipulation.  Not everything needs big data and there’s a lot you can do just from your own laptop.

I’m going to borrow from a great post here and do some manipulation on some text I found.  I’m not going to use any hadoop.  Just plain python.

I’ll use the same script, but I stripped out the punctuation with some help from Stack Overflow:


Here’s the pass:

We’re simply printing all the text and then piping to our mapper program and then storing the results in the bom-out.txt file.

So now we have bom-out.txt which is just a word with a count next to it:

foo 1

Now we need to sort it.

So now we have bom-sorted-out.txt.  So next up, we do the word count.

This gives us the output, but now let’s see which word is used the most.  This is another sort.

This gives some of the usual suspects:

We could probably do better if we were to make it so case didn’t matter.  We could then also do it in a one pass script.  Let’s try it.

Version 2

In the mapper script we change the last line to be:

This way it spits everything out in lowercase.  Now to run the script in one line, we do the following:

Now our output looks a little different:

What we’ve shown here is the beginning of what things like Hadoop do for us.  We have unstructured data and we apply two operations:  Map:  This is where we do the count of each word.  Reduce:  This is where we count how many times each word was done.  In this case our data set wasn’t too huge and could be done on our laptop.

Here’s another book:

There are still a few problems but this seems to work well.  The next step is to use a natural language processing kit and find similar phrases.  We then could dump this into HDFS and process all kinds of books.  Lots of interesting places to go from here!

Last note: I did upload the data and scripts to github.