Category Archives: UCS

FCoE with UCS C-Series

I have in my lab a C210 that I want to turn into an FCoE target storage.  I’ll write more on that in another post.  The first challenge was to get it up with FCoE.  Its attached to a pair of Nexus 5548s.  I installed RedHat Linux 6.5 on the C210 and booted up.  The big issue I had was that even though RedHat Linux 6.5 comes with the fnic and enic drivers, the FCoE never happened.  It wasn’t until I installed the updated drivers from Cisco that I finally saw a flogi.  But there were other tricks that you had to do to make the C210 actually work with FCoE.


The first part to start is looking in the CIMC (with the machine powered on) and configure the vHBAs. From the GUI go to:

Server -> Inventory

Then on the work pane, the ‘Network Adapters’ tab, then down below select vHBAs.  Here you will see two vHBAs by default.  From here you have to set the VLAN that the vHBA will go over.  Clicking the ‘Properties’ on the interface you have to select the VLAN.  I set the MAC address to ‘AUTO’ based on a TAC case I looked at, but this never persisted.  From there I entered the VLAN.  VLAN 10 for the first interface and VLAN 20 for the second interface.  This VLAN 10 matches the FCoE VLAN and VSAN that I created on the Nexus 5548.  On the other Nexus I creed VLAN 20 to match FCoE VLAN 20 and VSAN 20.

This then seemed to require a reboot of the Linux Server for the VLANs to take effect.  In hindsight this is something I probably should have done first.

RedHat Linux 6.5

This needs to have the Cisco drivers for the fnic.  You might want to install the enic drivers as well.  I got these from  I used the B series drivers and it was a 1.2GB file that I had to download all to get a 656KB driver package.  I installed the kmod-fnic- RPM.  I had a customer who had updated to a later kernel and he had to install the kernel-devel rpm and recompile the driver.  After it came up, it worked for him.

With the C210 I wanted to bond the 10Gb NICs into a vPC.  So I did an LACP bond with Linux.  This was done as follows:

Created file: /etc/modprobe.d/bond.conf

alias bond0 bonding
options bonding mode=4 miimon=100 lacp_rate=1

Created file: /etc/sysconfig/network-scripts/ifcfg-bond0


Edited the /etc/sysconfig/network-scripts/ifcfg-eth2


Edited the /etc/sysconfig/network-scripts/ifcfg-eth3


Next restart the network and you should have a bond. You may need to restart this after you configure the Nexus 5548 side.

service network restart

Nexus 5548 Top
Log in and create VPCs and stuff.  Also don’t forget to do the MTU 9000 system class.  I use this for jumbo frames in the data center.

policy-map type network-qos jumbo
class type network-qos class-default
mtu 9216
system qos
service-policy type network-qos jumbo

One thing that drives me crazy is that you can’t do sh int po 4 to see that the MTU is 9000. From the documents, you have to do

sh queuing int po 4

to see that your jumbo frames are enabled.

The C210 is attached to ethernet port 1 on each of the switches.  Here’s the Ethernet configuration:

The ethernet:

interface Ethernet1/1
switchport mode trunk
switchport trunk allowed vlan 1,10
spanning-tree port type edge trunk
channel-group 4

The port channel:

interface port-channel4
switchport mode trunk
switchport trunk allowed vlan 1,10
speed 10000
vpc 4

As you can see VLAN 10 is the VSAN. We need to create the VSAN info for that.

feature fcoe
vsan database
vsan 10
vlan 10
fcoe vsan 10

Finally, we need to create the vfc for the interface:

interface vfc1
bind interface Ethernet1/1
switchport description Connection to NFS server FCoE
no shutdown
vsan database
vsan 10 interface vfc1

Nexus 5548 Bottom
The other Nexus is similar configuration.  The difference is that instead of VSAN 10, VLAN 10, we use VSAN20, VLAN 20 and bind the FCoE to VSAN 20.  In the SAN world, we don’t cross the streams.  You’ll see that the VLANS are not the same in the two switches.

Notice that in the below configuration, VLAN 20 nor 10 is defined for through the peer link so you’ll only see VLAN 1 enabled on the vPC:

N5k-bottom# sh vpc consistency-parameters interface po 4

Type 1 : vPC will be suspended in case of mismatch

Name Type Local Value Peer Value
————- —- ———————- ———————–
Shut Lan 1 No No
STP Port Type 1 Default Default
STP Port Guard 1 None None
STP MST Simulate PVST 1 Default Default
mode 1 on on
Speed 1 10 Gb/s 10 Gb/s
Duplex 1 full full
Port Mode 1 trunk trunk
Native Vlan 1 1 1
MTU 1 1500 1500
Admin port mode 1
lag-id 1
vPC card type 1 Empty Empty
Allowed VLANs – 1 1
Local suspended VLANs – – -

But on the individual nodes you’ll see that the VLAN is enabled in the VPC. VLAN 10 is carrying storage traffic.

# sh vpc 4

vPC status
id Port Status Consistency Reason Active vlans
—— ———– —— ———– ————————– ———–
4 Po4 up success success 1,10


How do you know you succeeded?

N5k-bottom# sh flogi database
vfc1 10 0x2d0000 20:00:58:8d:09:0f:14:c1 10:00:58:8d:09:0f:14:c1

Total number of flogi = 1.

You’ll see the login. If not, then try restarting the interface on the Linux side. You should see a different WWPN in each Nexus. Another issue you might have is that the VLANS may be mismatched, so make sure you have the right node on the right server.

Let me know how it worked for you!

Changing UCS IP addresses

I have a UCS lab machine that I sometimes take to different locations for proof of concept work.  One of the things I regularly have to do is change the password and hostname.  Here’s how you do it on the command line:

KCTest-A# scope fabric-interconnect a
KCTest-A /fabric-interconnect # set out-of-band ip netmask gw
Warning: When committed, this change may disconnect the current CLI session
KCTest-A /fabric-interconnect* # scope fabric-interconnect b
KCTest-A /fabric-interconnect* # set out-of-band ip netmask gw
Warning: When committed, this change may disconnect the current CLI session
KCTest-A /fabric-interconnect* # scope system
KCTest-A /system* # set virtual-ip
KCTest-A /system* # set name ccielab
KCTest-A /system* # commit-buffer


It’s great because you can change all the IP addresses on each server, the virtual server, and the hostname in one shot.

Source of docs


Cloud Computing: How Do I Get There?

This post comes from a talk that I’ll be presenting on at the Pacific Northwest Digital Government Summit Conference on October 2nd, 2013.

History shows us that those that embrace technology and change survive while those that resist and stick with “business as usual” get left behind.  If we have the technology and we don’t use it to make IT look like magic, then we’re probably doing it wrong. (Read “The Innovator’s Dilemma” and Clarke’s Three Law.)

I’ll be talking mainly about private cloud today, but many of these ideas can be taken into the public cloud as well.

Optimizing ROI on your Technology

My friend tells a story about when his wife first started using an iPhone.  To get directions on a map she’d open up Safari and go to  To check Facebook she would open Safari and go to  To check her mail she’d open up Safari again and navigate to  You get the idea.

She was still getting great use of her iPhone.  She could now do things she could never do before.  But there was a big part she was missing out on.  She wasn’t using the App ecosystem that makes all of these things easier and delivers a richer experience.

Today, most organizations have virtualization in the data center.  Because of this IT is able to do things they’ve never been able to do before.  They’re shrinking their server footprints to once unimaginable levels saving money in capital and management costs.  I’ve been in many data centers  where people proudly point to where rows of racks have been consolidated to one UCS domain with only a few blades.  Its pretty cool and very impressive.

But they’re missing something as big as the App Store.  They’re missing out on the APIs.  This is where ROI is not being optimized in the data center in a big way.

IT is shifting (or has shifted) to a DevOps model. DevOps means that your IT infrastructure team is more tightly aligned with your developers/application people.  This is a management perspective.  But from a trenches perspective, the operations team is now turning into programmers.  Programmers of the data center.  The guy that manages the virtual environment, the guy who adds VLANs to switches, or the guy who creates another storage LUN: they’re all being told to automate and program what they do.

The group now treats the IT infrastructure like an application that is constantly adding features and doing bug fixes.

The programming of the IT infrastructure isn’t done in compiled languages like Java, C, or C++.  Its done in interpreted languages like Python, Ruby, Bash,  Powershell, etc.  But the languages alone don’t get you there.  You need a framework.  This is where things like Puppet or Chef come into play.  In fact, you even can look at it like you’re programming a data center operating system.  This is where OpenStack provides you a framework to develop your data center operating system.  Its analogous to the Web Application development world.  Twitter was originally developed in Ruby using a framework called Ruby on Rails.  (Twitter has since moved off Ruby on Rails).

Making this shift gives you unprecedented speed, agility, and standardization.  Those that don’t do it, will find their constituents looking elsewhere for IT services that can be delivered faster and cheaper.

The IT assembly line

Its hard for people to think of their IT professionals as assembly line workers.  After all, they are doing complex things like installing servers, configuring networks, and updating firmware.  These are CCIEs, VCPs, and Storage Gurus.  But that’s actually what people in the trenches are:  Workers of the virtual Assembly line.  IT managers should look at the way work enters the assembly line, understand the bottlenecks, and track how long it takes to get things through the line.  Naturally, there are exceptions that crop up.  But for the most part, the work required to deliver applications to the business are repetitive tasks.  They’re just complicated, multi-step, repetitive tasks.

To start with, we need to look at the common requests that come in:  Creating new servers, deploying new applications, delivering a new test environment.  Whatever it is, management really needs to understand how it gets done, and look at it like the manufacturing foreman sitting above the plant, looking down and watching a physical product make its way through.  Observe which processes are in place, where they are being side stepped, or where they don’t exist at all.

As an example, consider all the steps required to deploy a server.  It may look something like the flowchart below:

That sure looks like an assembly line to me.  If you can view work that enters the infrastructure like an assembly line, you can start measuring how long it takes for certain activities to get done.  Then you can figure out ways to optimize.

Standardization of the Infrastructure

Manufacturing lines optimize throughput by standardizing processes and equipment.  When I hear VMware tell everybody that “the hardware doesn’t matter”, I take exception.  It matters.  A lot.  Just like your virtualization software matters.  Cisco and other hardware venders come from it the opposite direction and say “the hypervisor doesn’t matter, we’ll support them all”.  What all parties are really telling you is that they want you to standardize on them.  All parties are trying to prove their value in a private cloud situation.

What an organization will standardize on depends on a lot of things: Budget, skill set of Admins, Relationship with vendors and consultants, etc.  In short, when considering the holy trinity of the data center: Servers, Storage, & Networking it usually gets into a religious discussion.

But whatever you do, the infrastructure needs to be robust.  This is why the emergence of Converged Infrastructures like Vblocks, FlexPods, and other reference architectures have become popular.  The  “One-Piece-At-A-Time” accidental/cobbled architecture is not a good play.

Consider the analogy that a virtualized workload is cargo on a Semi Truck.  Do you want that truck running over a 6 lane solid government highway like I-5 or do you want that stuff traveling at 60mph down a rinky bridge?


Or This?

Similarly, if your virtualization team doesn’t have strong Linux skills, you probably don’t want them running OpenStack on KVM.  That’s why VMware and Hyper-V are so popular.  Its a lot easier for most people’s skill level.

What to Standardize On?

While the choice of infrastructure standardization is a religious one, there are role models we can look to when deciding.  Start out by looking at the big boys, or the people you aspire to be when you grow up.  Who are the big boys that are running a world class IT as a service infrastructure?  AWS, RackSpace, Yahoo, Google, Microsoft, Facebook, right?

What are they standardizing on?  Chances are its not what your organization is doing.  Instead of VMware, Cisco, IBM, HP, Dell, EMC, NetApp, etc, they’re using open source, building their own servers, and using their own distributed filesystems.  They do this because they have a large investment in their DevOps team that is able to put these things together.

A State organization that has already standardized on a FlexPod or Vblock with VMware is not going to throw away what they’ve done and start over just so they can match what the big boys do.  However, as they move forward, perhaps they can make future decisions based on emulating these guys.

Standardize Processes

The missing part is standardizing the processes once the infrastrucutre is in place.  Standardization is tedious because it involves looking at every detail of how things are done.  One of my customers has a repository of documentation they use every time they need to do something to their infrastructure.  For example, 2 weeks ago we added new blade servers to the UCS.  He pulled out the document and we walked through it.  There were still things we modified in the documentation, but for the most part the steps were exact.

Unfortunately, this was only one part of the process.  The Networking team had their own way of keeping notes (or not at all) on how to do things.  So the processes were documented in separate places.  What the IT manager needs to do is make sure they understand how the processes (or work centers) are put together and how long each one takes.

The manager should be able to have their own master process plan to be able to track work through the system.  (The system being the different individuals doing the work).  This is what is meant by “work flow”.  Even if they just do this by hand or as is commonly done with a Gantt chart, there should be some understanding.

Each job that comes in, should get its own workflow, or Gantt Chart, and entered into something like a Kanban board.  Once you understand this for the common requests, you can see how many one offs there are.

Whether these requests are for public cloud or private cloud, there is still a workflow.  It is an iterative process that may not be complete the first few times it is done, but over time will become better.  There is a great book called “The Phoenix Project” that talks about how the IT staff starts to standardize and work together between development and operations to get their processes better.  These ideas are based off an earlier business classic called “The Goal”

Automate the Processes

Once the processes are known we turn our assembly line into programmers of the processes.  I used to worked as a consulting engineer to help deploy High Performance Computing clusters.  On several occasions the RFPs required that the cluster be able to be deployed from scratch in less than 1 hour.  From bare metal, to running jobs.  We created scripts that would go through and deploy the OS, customize the user libraries, and even set up a job queuing system.  It was pretty amazing to see 1,200 bare metal rack mount servers do that.  When we would leave, if the customer had problems with a server then they could replace it, plug it in, and walk away.  The system would self provision.

While that was a complicated process and still is, it is still simpler than what virtualization has done to the management of the data center.  We never had to mess with the network once it was set up.  Workflows for a new development environment are pretty common and require provisioning several VMs with private networks and their own storage.  However, the same method of scripting the infrastructure can still be applied.  It just needs to be orchestrated.

Automate and Orchestrate with a Framework

Back when we did HPC systems, we used an open source management tool called xCAT.  That was the framework by which we managed the datacenter.  The tool had capabilities but really what it gave us was a framework to insert our customizations or our processes that were specific for each site.  The tool was an enabler of the solution, not the solution itself.

Today there are lots of “enterprise” private cloud management tools.  In fact, any company that wants to sell a “Private Cloud”  will have its own tool.  VMware vCloud Director, HP Cloud System, IBM Cloudburst, Cisco UCS Director, etc.  All of these products, regardless of how they are sold should be regarded as frameworks for automating your processes.

At a recent VMUG, the presenter asked “How many people are using vCloud Director or any other cloud orchestration tool?”  Nobody raised their hand.  Based on what I’ve seen its because most organizations haven’t yet standardized their IT processes.  There is no need for orchestration if you don’t know what you’re orchestrating.

Usually each framework will come with a part or all of what Cisco calls the “10 domains of cloud” which may include: A self service portal, chargeback/showback, service catalog, security, etc.  If you are using a public cloud, you are using their framework.

Once you select one, you’ll need to get the operations teams (network, storage, compute, virtualization) to sign off and use the tool.  Its not just a server thing.  Each part of the assembly line needs to use it.

Once the individual components are entered into the framework, then the orchestration comes to play.  To start with, codify the most common workloads:  Creating VLAN, Carving out a LUN, Provisioning a VM, etc.

To orchestrate means to arrange or control the elements of, as to achieve a desired overall effect.  With the Framework, we are looking to automate all of the components to deliver a self service model to our end customer.

Self Service and Chargeback

Once we have the processes codified in the framework, we can now present a catalog to our users.  With a self service portal we recommend it not being completely automated to start out with.  With some frameworks, as a workload moves through the automated assembly line, it can send an email to the correct IT department to validate whether a workflow can move through.  So for example, if the user as part of the workflow wants a new VLAN for their VM environment, the networking administrator will receive an email and will be able to approve or deny.  This way, the workflow is monitored, the end requester knows where they are in the queue, and  once it is approved, it gets created automatically, then gets passed along to the next item in the assembly line.

For chargeback, the recommendation is to keep the menu small, and the price simple.

Security all throughout then Monitor, Rinse, and Repeat

More workflows will come into the system and the catalog will need to continuously need updating and revisions.  This is the programmable data center.  Iterations should be checked into a code repository similarly to how application developers use systems like to store code updates.  You will have to do bug fixes and patch up any exposed holes.  With virtualization comes the ability to integrate more software security services like the ASA 1000v, or the VSG.

Action Items

  • Realize that your IT infrastructure is a collection of APIs waiting to be harnessed and programmed.  Challenge the people you work with to learn to use those APIs to automate their respective areas of expertise.
  • Optimize the assembly line by understanding the workflows.  Any manufacturing manager can tell you the throughput of the system.  An IT manager should be able to tell you the same thing about their system.  Start by understanding the individual components, how long it takes, and where the bottlenecks in the system are.
  • Standardize your infrastructure with a solid architecture.  Converged architectures are popular for a reason.  Don’t reinvent the wheel.
  • Standardizing processes is the hardest part.  Start with the most common.  These are usually documented.  Take the documentation and think how you would change it into code.
  • Program the DataCenter using a Framework.  Most of the work will have to be done in house or with service contracts.  The framework could be something like a vendors cloud software or something free like OpenStack.


UCS Reverse Path Forwarding and Deja-Vu checks

UCS Fabric Interconnects are usually always run in end-host mode.  At this point in the story there really isn’t that many reasons to use switch-mode on the Fabric Interconnects.

Two checks, or features that make End Host Mode possible are Reverse Path Forwarding (RPF) checks and Deja-Vu checks.

RPF and Deja-Vu (from

Reverse Path Forwarding Checks

Each server in the chassis is pinned dynamically (or you can set up pin groups and do it statically, but I don’t recommend that) to an uplink on Fabric Interconnect A and Fabric Interconnect B.  Let’s say you have 2 uplinks on port 31 and 32 of your Fabric Interconnect.  Server 1/1 (chassis 1 / blade 1)  may be pinned to port 31.  If a unicast packet is received for server 1/1 on uplink port 31, it will go through.  But if that same packet destined for server 1/1 is received on port 32, it will be dropped.  That’s because RPF checks to see if the destination for the unicast is actually forwarding its uplink traffic through that link.

Deja Vu Checks

The other check is called “Deja-Vu” .  In the Cisco documentation it says: “Server traffic received on any uplink port, except its pinned uplink port is dropped“.  That sounds a lot like RPF.  Another presentation from Cisco live states it this way: “Packet with source MAC belonging to a server received on an uplink port is dropped

An example to clear it up

VM A on server 1/1 wants to talk to VM B located somewhere else.  The Fabric Interconnects in this case are connected to a single Nexus 5500 switch.  The VM is pinned to one of the VNICs and that VNIC is pinned to go out port 31 of Fabric Interconnect A.  So what happens?

First the VM will send an ARP request.  An ARP request basically says:  I know the IP address but I want the MAC address.  (Obviously, this is in the same Layer 2 VLAN and subnet).  If Fabric Interconnect A doesn’t find the IP/MAC association in its CAM table, then it will not flood the server ports down stream.  That is something a switch would do.  The Fabric Interconnect is different.  The reason the Fabric Interconnect doesn’t send a broadcast down its server ports is because it is a source of truth and knows everyone connected on its server ports.

What it will do instead is forward the ARP request (unknown unicast) up the designated uplink (port 31).  Now the Nexus switch is a switch.  (And a very good one at that).  It will say:  “Hey, I don’t have a CAM table entry for VM B IP/MAC so I will do what we switches do best:  Flood all the ports! (except the port that the unknown unicast/ARP request came in on)

Remember Fabric Interconnect A port 32 is connected to this same switch as port 31 where the unknown unicast (ARP request) went out.  The Nexus 5500 will send this unknown unicast to port 32 just like every other port.  But port 32 says:  Wait a minute, the source address originated from me.  Deja-vu!  So he drops the packet.

Fabric Interconnect B has two ports 31 and 32 that will also receive the unknown unicast.  If VM B is pinned to a VNIC that is pinned to port 31 on Fabric Interconnect B, he will say:  I got this!  And the packet will go through.  Port 32, however on FI-B will look at the destination MAC and say:  This is not pinned to me, so I’ll drop the packet.  That is the RPF check.

To sum it up

Deja-Vu check:  don’t receive a packet from the upstream switch that originated from me.

Reverse Path Forward Check:  don’t receive a packet if there’s no server pinned to this uplink.

Backing up UCS

Backing up UCS can be a little confusing especially since it presents you a few options.  What you may be expecting is something simple like a one button easy “Back it up” button.  But in fact, that is not the case.  And the nice thing about it is there are lots of different things you can do with backup files.
From the Admin Tab under All in UCS Manager, under the general tab, you select “Backup Configuration”

But now, we have a few choices as to how we set this up.  Now you create a backup operation

Then you are presented the below screen and now things get a little bit complicated.

Let’s go through some of these seemingly confusing options:

Admin State

This is a bit confusing.  But here’s how to think about it:  If you want to run the backup now, right this second, when you click “OK” and don’t want to wait, select “Enabled”.  Most of the time this is what you want.  If instead, you just want to save this backup operation, so that you can click it on the Backup operations list and do it, then set disabled.


There are 4 different configurations that can be backed up by UCS.  All of them deal with data that lives in the Fabric Interconnect.  They are illustrated in the diagram below


The brim of the triangle is the Full State.  This is a binary file that can be used to backup on any system to restore the settings that this Fabric Interconnect has.  Its different than all the other types.  Its the only one that can be used for system restore.  This is usually fun to backup off your own system.  I haven’t tried putting it into the platform emulator yet, but it might be fun to try.

The three other backups are just XML files.  They’re useful for importing into other systems.  The “All Configuration” is just a fancy way of saying “System Configuration” and “Logical Configuration”.  It does both.

The System Configuration is user names, roles, and locals.  This is useful if you are installing another UCS somewhere and you want to keep the same users and locales (if you are using some type of multi-tenancy) but in that case, why aren’t you using UCS Central?  Try it, its free for up to 5 domains.  And you can do global service profiles.

The Logical Configuration is all the pools and policies, service profiles, service profile templates you would expect to be backed up.  This is pretty good to put inside the emulator to fool around with different settings you are using.  Or, if you don’t have your UCS yet and you’re waiting to order it, then you can just create the pools and policies in the emulator.  Then when the real thing comes, import the logical configuration in and you are ready to rock.

The tricky button that shows up when you select the All Configuration or the Logical Configuration is the label:  Preserve Identities. This is only on logical and all configurations because it has to do with making service profiles that are already mapped to pools retain their mapping.  This is good if you’re going to move some service profiles from one fabric interconnect domain to another and want to keep the same setup.  Otherwise, it doesn’t really matter to keep those identities.

The other options presented for how you want to back up the system is pretty self explanatory.  You can either back this up to your local machine or some other machine that has another service running like SSH, TFTP, etc.

After you’ve created a backup operation, the nice thing is that it saves it for you in a backup operations list.  When you want to actually do it, just select it, then hit admin enable and it will perform the backup.

Performing Routine Periodic Backups

But wait you say, what if I want it to periodically backup itself?

Well, that’s where you move to the next tab which is the Policy Backup & Export

Here you have the option of backing up just the binary system restore button, or the all-configuration.  The all configuration is good for backing up XML files just in case some administrator accidentally changes a bunch of configs on you.

Here you can see, My XML and binary files will be backed up every day.  (That may be a little more than you need, as things don’t usually change so much in most environments, but hey, now you have it, use it.)

When it saves to those remote files you’ll get a timestamp on the name:


So that’s backing up the system and all the ways it can be done.  There’s a few nerd nobs, but I wanted to make sure I understood it.

The last thing to cover is import operations.  Its important to understand that you can do two different types:  A merge or replace.  With merge, if you have a MAC pool called A and it has 30 MACs already, a merge will add the new MACs to it.  (So if there are 20 in the import, you will now have 50).  With replace, you’ll now just have 20.  You can only merge XML files.

Lastly, all of this information is found here in the latest  UCS GUI Configuration Guide It was nice to gain a more solid understanding of it.  Backing up is something I go over briefly in some of my tech days I do, but this flushes it out a little better if there are any further questions.

Thanks for reading!



Cisco UCS East-West Traffic Performance.

The worst thing you can do in tech is claim something positive or negative about some technology without anything to back it up.  Ever since UCS was first brought to market, other blade vendors have been quick to point out any flaw they can find.  This is mostly because their market share of the x86 blade space has been threatened and in some cases (IBM & Dell) surpassed by UCS.

One of the claims that I’ve heard while presenting UCS is that the major flaw with the architecture makes switching between to blades inferior to the legacy architectures that other hardware vendors use.  You see, (they told me) in order for one UCS blade to communicate to another UCS blade you have to leave the chassis, go into the Fabric Interconnects (that could be all the way at the top of rack, or even in another rack), and then come back into the chassis.  This must take an eternity.

Network traffic from one blade to another in the same chassis is called “East-West” traffic because the traffic doesn’t leave the chassis.  (Picture it going sideways) where as “Nort-West” traffic is network traffic that leaves the chassis and goes out to some other end point that doesn’t reside in the chassis.  The widely held belief was that UCS was a a huge disadvantage here.

After all, every other blade chassis on the market has network switches that sit inside the chassis and *must* be able to perform faster than UCS.  For a while now, I’ve wondered how much latency that adds.  Because, frankly, I thought the same way they did.  Surely the internal wires must be faster than twinax cables.

But science, that pesky disprover of legacy traditions and beliefs, has finally come to settle the argument.  And in fact has turned the argument on its head.  The east-west traffic inside UCS is faster than the legacy chassis.

The full blog can be read here.  There’s a link to a few great papers on this site that show how the measurements done.

Plus one for the scientific method!

Hacking UCS Manager to get pictures

I was reading the API for UCS manager the other day (hey, everybody has a hobby right?) and I found out a pretty cool place where the Java UCS Manager downloads the picture files.  I still haven’t found all the files (like the Fabric Interconnects and the Chassis, and IOMs) but most of the server models are found this way.  Substitute your UCS Manager IP address into the script below and it will download the pictures of the blades.  I wish I would have known this before I gathered pictures for UCS Tech Specs as these are great pictures.

wget http://$IP/blade/B230.png
wget http://$IP/blade/B230.png
wget http://$IP/blade/B440.png
wget http://$IP/blade/Blade_full_width_front.png
wget http://$IP/blade/Blade_full_width_front.png
wget http://$IP/blade/Blade_half_width_front.png
wget http://$IP/blade/Blade_half_width_front.png
wget http://$IP/blade/Blade_half_width_front_marin.png
wget http://$IP/blade/Blade_half_width_front_marin.png
wget http://$IP/blade/SfBlade.png
wget http://$IP/blade/SfBlade.png
wget http://$IP/blade/sequoia_front.png
wget http://$IP/blade/sequoia_top.png
wget http://$IP/blade/silver_creek_front.png
wget http://$IP/blade/silver_creek_top.png
wget http://$IP/blade/ucs_b200_m3_front.png
wget http://$IP/blade/ucs_b200_m3_top.png
wget http://$IP/fi/switch_psu_DC.png
wget http://$IP/rack/Alameda_1_front.png
wget http://$IP/rack/Alameda_1_top.png
wget http://$IP/rack/Alameda_2_front.png
wget http://$IP/rack/Alameda_2_top.png
wget http://$IP/rack/Alpine_M2.png
wget http://$IP/rack/Alpine_M2_front.png
wget http://$IP/rack/C220M3_front_small.png
wget http://$IP/rack/C220M3_top.png
wget http://$IP/rack/C420_front.png
wget http://$IP/rack/C420_internal.png
wget http://$IP/rack/SD1_Gen2_front.png
wget http://$IP/rack/SD1_Gen2_front.png
wget http://$IP/rack/SD1_Gen2_internal.png
wget http://$IP/rack/SD1_Gen2_internal.png
wget http://$IP/rack/san_mateo_front.png
wget http://$IP/rack/san_mateo_internal.png
wget http://$IP/rack/sl2_front.png
wget http://$IP/rack/sl2_front.png
wget http://$IP/rack/sl2_top.png
wget http://$IP/rack/sl2_top.png
wget http://$IP/rack/st_louis_1u_front.png
wget http://$IP/rack/st_louis_1u_top.png
wget http://$IP/rack/st_louis_2u_front.png
wget http://$IP/rack/st_louis_2u_top.png

VIFS in a UCS environment

First of all you may be asking if you stumbled upon this page:  “What is a VIF?”.  A VIF is a Virtual interface.  In UCS, its a virtual NIC.
Let’s first examine a standard rack server.  Usually you have 2 ethernet ports on the mother board itself.  Now days, the recent servers like the C240 M3 have 4 x 1GbE onboard interfaces.  Some servers even have 2x10GbE onboard NICs.  That’s all well and good and easy to understand because you can see it physically.
Now let’s look at a UCS blade.  You can’t really see the interfaces because there are no RJ-45 cables that connect to the server.  Its all internal.  If you could see it physically, then you’d see that you could add up to 8x10Gb physical NICs per half width blade.  Just like a rack mount server comes with a fixed amount of PCI slots, a blade has built in limits as well.  But Cisco blades work a little different.  Really, there are 2 sides:  Side A and Side B, each with up to 4x10GbE physical connections.  And those 4x10GbE are port channeled together, so it looks like one big pipe depending on what cards you put in there.
With these two big pipes (that are between two 10Gb and two 40Gb) we create virtual interfaces over these that are presented to the operating system.  That’s what a VIF is.  These VIFs can be used for some really interesting things.
VIF Use Cases
  1. It can be used to present NICs to the operating system.  This makes it so that the operating system thinks it has a TON of real NICs.  The most I’ve ever seen though is 8 NICs and 2 Fibre Channel adapters.  (Did I mention that Fibre Channel counts as a VIF?)  So 10 is probably the most you would use with this configuration.
  2. It can be used to directly attach virtual machines with a UCS DVS.  This is also one version of VM-Fex.  Here, UCS Manager acts as the Virtual supervisor and the VMs get real hardware for their NICs.  They can do vMotion and all that good stuff and remain consistent.  I don’t see too many people using this, but the performance is supposed to be really good.
  3. It can be used for VMware DirectPath IO.  This is where you tie the VM directly to the hardware using VMware DirectPath IO bypass method.  (Not the same as the UCS Distributed Virtual Switch I mentioned above.)  The advantage UCS has is that  you typically cannot do vMotion when you do VMware DirectPath IO.  With UCS, you can!
  4. USNIC (future!!!)  Unified NIC is where we can present one of these virtual interfaces directly to user space and create a low latency connection in our application.  This is something that will be enabled in the future on UCS, but it means we dynamically create these and can hopefully get latencies around 2-3 microseconds.  This is great for HPC apps and I can’t wait to get performance data on this.
  5. USNIC in VMs.  (future!!!)  This is where a user space application running in a VM will have the same latency as a physical machine.  That’s right.  This is where we really get VMs doing HPC low latency connections.
So now that we know the use cases, how can you tell how many virtual interfaces or VIFs you have for each server?  Well, it depends on the hardware and the software.  You see, they all allow for growth, but some instances have limitations.  So that’s what I’m hoping to explain below.
UCS Manager Limitations and Operating Systems Limitations
For 2.1 this is found here.  For other versions of UCS manager, just search for “UCS 2.x configuration limits”.
The Maximum VIFS per UCS domain today is 2,000

The document above also shows that for ESX 5.1 its 116 per host.  The document references UPT and PTS.
UPT – Uniform Pass Thru (this is configured in VMware with direct Path IO, use case 3 as I mentioned above)
PTS – Pass through Switching (this is UCS DVS, or use case 2 as I mentioned above)
Fabric Interconnect VIF Perspective
Let’s look at it from a hardware perspective.  The ASICs used on the Fabric Interconnects determine the limits as well.
The UCS Fabric Interconnect 6248 uses the “Carmel” Unified Port Controller.  There is 1 “Carmel” port ASIC for every 8 ports.  So ports 1-8 are part of the first Carmel ASIC, etc.  In general, you want the FEX (or IO Module) connected to the same Carmel.
Each Carmel ASIC allows 4096 VIFs which are equally divided into all 8 switch ports.  Therefore, 512 VIFS per port.  Since one of those VIFs is dedicated to the CIMC, that gives 511 VIFS per port.  Consider that there are 8 slots in each chassis, so you would further divide that up between the 8 blade slots, so that’s 64 max in each slot.  Some are reserved, so it ends up being 63 VIFs per slot. That’s why the equation ends up being 63*n – 2 (2 are used for management)
Cisco Fabric Interconnect 6200
Uplinks Per FEX Number of VIFs per slot
1 61
2 124
4 250
8 502
The 6100 uses the Gatos port controller ASIC.  There are 4 ports managed per Gatos ASIC.
Each Gatos ASIC allows 512 VIFs or 128 VIFS per port.  (512 VIFs per ASIC / 4 ports).  Each of those 4 ports gets divided by the 8 slots.  So, 128 / 8 = 16.  However, some of those are reserved, so it ends up being only 15 VIFs per slot.   That’s why the equation of VIFs per server is 15*n – 2  (the 2 are used for management)
Cisco Fabric Interconnect 6100
Uplinks per FEX Number of VIFS per slot
1 13
2 28
4 58
8 118 (obviously requres 2208)
VIFs from the Mezz Card Perspective
The M81KR card supports up to 128 VIFs.  So you can see from above that with the 6100 and 2104/2204/2208 its not the bottle neck.
The VIC 1280 which can be placed into the M1 and M2 servers can do up to 256 VIFs.
Hopefully that clarified VIFs a little and where the bottle necks are.  Its important to note as well that I/O modules don’t limit VIFs.  They’re just passthrough devices.

Fusion IO: Software Defined Storage

Originally Posted Dec 14, 2012

This week I was very privileged to go to Salt Lake City to Fusion IO headquarters and get a deep dive on their technology and how it differentiates from other competitors in the high speed, low latency storage market.  (Which is really starting to just be the general storage market these days instead of something niche.)  It was super neat to go there with a bunch of my buddies from Cisco and I can’t thank them enough for having us and treating us so go.  This meeting was brought about primarily because Fusion IO and Cisco have introduced a new Mezzanine Fusion IO card for the B series blades.  Specifically:  The B200 M3 and B420 M3.  The card is the same as the ioDrive2  but in Cisco blade server form factor.  This first version is 758GB drive.  We had a great time and and learned a ton.

Fusion IO’s main product is branded as ioMemory.  Its marketed as a new memory tier.  The co-founder, David Flynn had the idea of taking Flash Memory,  putting it on a PCI card, slap a basic controller on it and putting it into the server.  The server would then see this as a hard drive.  By not using legacy protocols of SAS or SATA and using their own protocol with software, they were able to get IO latency down to microseconds from milliseconds.  Couple this with flash memory and it translates to more IOPS, which means applications that normally have to wait for disk reads and writes can do it on orders of magnitudes faster.  One of the examples they cited said that customers were getting 20 times the performance with these drives compared to using standard disk drive arrays.  (Not 20% better, but 20 x)  From that above linked Wikipedia article it shows that fastest 15k RPM hard drive will only get around 200 IOPS.  Compare that to a Fusion IO card that same article shows 140,000 IOPS.  (My notes also say that they are getting 500,000 IOPS, I’m not sure which is correct but the idea is that its blazing fast.)

If you aren’t familiar with the state of the data center, and I commented on this in myVMworld 2012 post, storage is one of the biggest problem.  Numerous blogs, articles, and talks show that storage is the biggest bottle neck and the largest expense in the data center.  Duncan Epping commented at VMworld on the topic of performance that “The network is usually blamed but storage is usually the problem”.   There is a famine happening in our data centers.  Applications are starved for data.  They are waiting and waiting to get their data from disks.  Applications today are like a bunch of hungry children at home whose moms went to the store to get food and are all stuck in traffic with other moms taking a long time to make that round trip.  Storage IO performance has not kept up with the spectacular rate that processing power has improved over the last decade or so.

What we have been doing for the last 10 years (me personally and others) is designing storage systems that we think will meet the performance requirements.  When we get up and running we soon find that the storage system doesn’t meet the performance needs so we throw more disks at them until it does.  Soon we have tons more capacity than we need and a bigger footprint.  I’m not alone in this.  This is standard practice.  Commenting on this, Jim Dawson, the Vice President of world wide sales wrote for all of us to see:  DFP = RIP.  This he said means:  Disks for performance is dead.  He also mentioned that his customers when he was at 3PAR were adding so many disks for performance that they asked him to make smaller disk sizes because they didn’t need capacity, they needed performance.

Flash Memory to the rescue

The reason you are probably hearing so much about flash memory now and not before is because the price of flash memory has fallen below the price of DRAM (the kind of memory that when you pull the power power it, it forgets everything that was in it).  Flash memory, specifically NAND flash, is the flash that’s used in Fusion IO, SSDs,  SSD arrays, and pretty much everything you see out there that’s called flash storage.   This type of memory when you pull the power doesn’t forget which bits were flipped to ones or zeros.  NAND flash are the building blocks for nearly all the fast storage you’ve been hearing about.  From  people making USB thumb drives, SSDs, PCI SAS, Violin, or Texas Memory Systems (now IBM) and make arrays with them using their own controllers, they’re all using NAND flash.

The difference is how the flash is accessed.  SSDs go through the SAS or SATA controllers that add significant over head.  That makes it slower since those are legacy protocols used for hard drive technology.  But if you have one in your mac book pro like I hope to have soon, then you are not complaining and its just fine.  Most of the Flash storage solutions out there are based on using SAS/SATA protocols to access flash storage: Nimble storage, whiptail, etc. Its more simple to develop because the protocols are already defined and they can concentrate on value add at the top, like putting more protocols or better management tools in it.

Fusion IO has two advantages over these technologies.  First, since they are on the PCI bus, they are closer to the processor so its much faster.  Second, they don’t have the overhead of a controller translating older protocols.  There’s a driver that sits on the OS that manages it all.  Since they don’t go through the standard protocols they can also add better monitoring tools and even add more on top of that to innovate cool solutions.  (ioTurbine is an example of this that I’ll get to in a minute)

Fusion IO secret sauce

The ioDrive2 card is main product.  Its a PCIe card with a bunch of 25nm NAND flash chips on it.  We had this amazing Fusion IO engineer named Bob Wood come in and talk to us about how it works.  He schooled us so hard I thought I was back incollege.  We were worried we were going to get more marketing but in the words of @ciscoservergeek: Our expectations were amazingly surpassed.

Flash memory has what’s called an Erase Block.  This is the smallest atomic unit that can be written.  As flash gets smaller having 3 or more electrons leave, or somehow get disturbed will cause the erase block to flip a bit and be wrong.  The controller software is then always looking to make sure things are still the way they should be.

A standard fusion IO card is built in with about 20% of spare capacity that is used for when erase blocks get contaminated or flipped too many times.  Bob equated it to standing on top of a mountain and being struck by lightning.  There’s only so many times you can be struck by lightning and still go on.  (Apparently NAND flash can handle it more than humans).  When one of these erase blocks is retired, the card draws from the 20% pool.  In addition, other erase blocks are reserved for features to handle more error checking.  More official information on this “Adaptive Flashback” is here.

I asked then:  So if I have a 750GB card, do I only get to see 600GB of space?  No, the 20% overhead plus other reserved pools is in addition to the 750GB, so you will see that much capacity.  I imagine that the raw capacity is probably from 900GB to 1TB.

Bob told us that the design of the card is the classic engineering tradeoff design and finding the ultimate efficiency.  You have to worry which NAND flash you use, multiple suppliers, price/ performance, how much you can fix in software, how much you need to make sure you are error checking vs speed, capacity vs. features, etc.  It sounded like a fun multivariable calculus problem.

The other thing that was cleared up to me was the nature of the product.  DoesFusion IO make one product thats a hard drive and the other one a memory cache?  No.  Physically, its one product.  But you can license software to give it more features.  You’ll hear messaging of ioTurbine and DirectCache from them.  Those marketing terms describe software functions you can put on top of the ioDrive2  by licensing software.  ioTurbine is for VMs and DirectCache is for bare metal.  Its essentially makes the card act as memory cache for the VM or physical machine.

And this is where I suspect Fusion IO will continue to innovate: Software on the NAND flash.  How to make it more useful and do more things.

Fusion IO Tradeoffs

Like every technology, there are tradeoffs and no single technology is going to solve all your data center needs.  Isn’t that why we pay architects so much money?  To gather all these great technologies and choose the best ones to meet the needs?  Anyway, here are some tradeoffs:

Price: Its no mystery that Fusion IO drives aren’t super cheap.  You can buy at least 2 very nice servers for the price of the card, but that may not solve your IO problem.  But if you look at it that you can instead buy Fusion IO rather than some supped up disk array, then it might actually be cheaper.  In fact, they showed a case studies where it over 70% cheaper than getting big storage arrays.

Redundancy and HA: If you have one card in the server, that’s a single point of failure.  Now granted there are no moving parts, so the MTBF goes up, but still you are putting lots of eggs in one basket.  If you have a modern application where redundancy is in the software then this isn’t going to be a problem for you.  For the legacy apps ran in most data centers Fusion IO talked to us about several different solutions you could use to do HA.  A lot of this sounded like what we used to do with xCAT to make it HA.  We’d use DRBD and Steeleye and those were the same things we were told about by Fusion IO.

Now there’s no reason you can’t buy two or more of these cards and put them in the same server and then just use software to RAID them together, but you’re not going to be able to do that in a B200 M3.  Further more, you’ll want to sync blocks between drives.  Fusion IO recognizes that people want this and that’s why ioN is a product that I think we’ll see lots more from.  (more on that in a second)

Capacity vs. Performance: 750GB drive is not too far away from the 1TB drives I can put in my servers.  Fusion IO told us about an Oracle survey where 56% of the big data clusters had less than 5TB of capacity.  That doesn’t sound like big data does it?  But big data isn’t really so much about size of the file as it is to gaining insight into lots of transactions and data points where each individual record can be quite small.  And in that game, performance is everything.  So even though you can’t get as much capacity on the Fusion IO drives, you can hopefully get the working set on there.  They showed examples where entire databases were run off the cards.  They also showed that in tiered storage designs the cards form yet another (or alternative?) tier by keeping most recently used data closer to the processor.

Shared Storage is still in vogue: Most of the customers I work with have a shared SAN that all the servers have access to.  Fusion IO cards are directly attached to individual servers.  Fusion IO addresses this with its ioN product which is essentially a shared block storage device created with standard servers and Fusion IO cards.  ioN then presents itself as an iSCSI or Fibre Channel Target.  It can be used in conjunction with a SAN as a storage accelerator.

The trends we have been hearing about lately show that distributed storage in commodity servers is the future.  Indeed, Gary one of the presenters mentioned that as well.  That would work very well for Fusion IO.  But this requires software.  Software Defined Storage.  (see what I did there?) Either something like Hadoop, Lustre, GPFS NSD could work on this today but probably not in the way people want for generic applications.  ioN right now only supports up to 3 servers.  (Sounds like VMware’s VSA doesn’t it?)  I think this technology shows great promise, but its not going to be able to replace the SAN in the data center right now.


Fusion IO is having tremendous success in the market place.  I like the Cisco andFusion IO partnership because it adds to Cisco’s storage portfolio partnerships and gives Cisco UCS users more options.

The thing that got me most excited was the ioN product.  By allowing the common man to build your own Violin memory / Texas Memory systems array with commodity servers, we’re getting more choices in how we do our storage.  It still has a bit to go before it can really replace a traditional storage array.  It doesn’t have snapshotting, de-duplication, and all those other cool features that your traditional storage has.  But just imagine:

– What if you could add SSDs and Spinning drives to your commodity servers along with the Fusion IO cards and ioN allowed you to use that as well?

– What if you could then had software that could do that auto tiering of putting most used data at the fastest Fusion IO cards?

– What if you added Atlantis iLIO for the deduplication in software to get that feature into ioN?

All of this points to one trend:  Software is the data center king.  Its got to have a good hardware design underneath, but when it comes down to it, Fusion IO is faster because its software is more efficient.

TL;DR on the TL;DR

Software defined everything is the king: but even the king needs a solid hardware architecture.


Cisco @ SC’12

originally posted on November 16th, 2012.  Restored from Backup

I just got back from SC’12 in Salt Lake City and it was as usual a fantastic event.  One of my favorite times of the year.  Most of this has to do with the fact that I’ve been going to this conference for almost 10 years so I’ve met a lot of great people.  Perhaps the saddest part about the conference was that I didn’t get to see nearly half of them.  I spent way too much time in the Hilton lounge on the 18th floor away from the maddening crowd doing my day job.  Oh well, it was still fun!

Cisco had a respectable booth this year.  Many people were surprised to see our rack mount servers and asked:  ”What is Cisco doing here?”.  A few years ago it made more sense because Cisco had acquired TopSpin and was a top InfiniBand vendor.  It wasn’t too much longer before Cisco shut down its InfiniBand business and went back to a converged network strategy for the data center that is based on Ethernet.  So why would it be back at SC’12? We do sell servers now, so that’s different.  But what would be compelling about Cisco UCS for an HPC solution?

It turns out a small team at Cisco including the great Jeff Squyres has been hard at work perfecting an ultra low latency solution for Ethernet.  Jeff has been involved in MPI longer than he’d care to admit (even from way back when HPC clusters solely used gigabit Ethernet).  He’s currently Cisco’s representative to the Open MPI project, which is what most people use today for HPC workloads.  MPI is a library which applications use to run across distributed systems.  I used to use these libraries a lot.  Usually what you do is when you compile an application like HPL, WRF, Pallas, etc. we would point it to the MPI libraries we wanted to use.  Then when the application runs it hooks into the APIs of those libraries and message passing happens.  Yay!

What Jeff and team found was that due to the virtualization capabilities of the Cisco VICs (Virtual Interface cards), they actually lend themselves quite well to offloading MPI tasks.  Before we get into that, lets make sure we understand the capabilities of the Cisco VIC.

Cisco VIC 1225

You can look at the specs here.  It looks like a dual port 10GbE card.  But what’s special about it is that its a CNA (converged network adapter) on steroids.  The original intent was for virtualization.  You can take this card and create 256 virtual interfaces.  If this card were plugged into a Cisco Nexus switch, then the network administrator could potentially see 256 interfaces that he could apply policies to.  What this means with VMware is that you can use hypervisor bypass (calledDirectPath I/O)  to get better performance and network visibility.  The Virtual Machine gets what it thinks is its own physical link.  There have been several versions of this card.  In rack mount servers, the P81E was the precursor.  In the blades, the M81KR is the old model while the VIC 1240 and the VIC 1280 are the current incarnations.

HPC on the VIC 1225

Now that you know what the VIC 1225 does in the VMware use case, you can think about it a different way.  Instead, think of all those VMs as just an application.  After all, that’s all an Operating System is.  Its just an application that runs other applications.  And when we run DirectPath I/O we’re basically just bypassing the local switch and giving the VM its own access to the I/O.  This can be perceived by the hypervisor that the particular lane is just running in user space.  Well, that’s exactly what we do with HPC.

The idea is that when an MPI application first launches and starts to use the VIC for message passing it has to start up following the blue arrows in the diagram above.  Once it hits the USNIC plugin to libibverbs it initiates its “boot strap” to set up the connection or to use one of the virtualized nics.  To do this it has to go to the kernel level and use the standard TCP stack that takes so much time and eliminates the latency.

Once this bootstrapped configuration is complete, the plugin then sets up a queue pair with the NIC and follows the red line, bypassing the kernel and OS stack.  This offloads CPU cycles and aids tremendously in cutting down the latency.

Jeff reports that they are able to get 1.7 microseconds when two of our C220 servers are connected back to back running a 19 byte packet.  That packet is an L2 frame: 1 byte payload and 18 bytes for the frame (6 Bytes for source MAC, 6 bytes for  destination MAC, 2 for Ethertype, 4 for CRC) .  1.7 microseconds back to back is pretty impressive.  To give a comparison, Infiniband back to back connection latencies can range between 500-600 nanoseconds to around 1.29 microseconds.

When you couple this card with the new Cisco Nexus 3548 switch which delivers about 190 nanoseconds port to port, then the latencies we have observed will now come out to around 2.3-2.4 microseconds.  YMMV.  Here’s how it breaks down:

– With back-to-back Lexingtons (VIC 1225), HRT pingpong latency is 1.7us
– With N3548 algo boost, add 190ns latency
– Prototype Open MPI USNIC support adds 300-400ns (Jeff thinks we can reduce this number over time)
So the MPI HRT pingpong latency over N3548 is somewhere around 2.3us

For many HPC applications that’s enough to compelling especially for using standard Ethernet

The final benefit is the management capability that you get with UCS Manager.  You can manage these rack mount environments the same way you manage blades.  You can give them service profiles and you keep the settings the same using policies and pooled resources.  Its very easy.  While UCS Manager works today with the rack mount servers, in order to use the USNIC capabilities with the ultra-low latency, you’ll have to wait until sometime in 2013.

The bottom line now is cost.  How much will it cost for Cisco gear compared to using standard rack mount servers with InfiniBand?   I’m willing to bet that Cisco in HPC is going to be price competitive, but the real savings are in operational costs:  You can get nearly the same benefits of InfiniBand without using a separate network.  That’s one less network to configure, manage, and maintain.  That’s one less vendor to have to deal with.

If you’re looking for a simple to manage HPC departmental cluster, this may just be the solution you’ve been waiting for.