I just got back from SC’12 in Salt Lake City and it was as usual a fantastic event. One of my favorite times of the year. Most of this has to do with the fact that I’ve been going to this conference for almost 10 years so I’ve met a lot of great people. Perhaps the saddest part about the conference was that I didn’t get to see nearly half of them. I spent way too much time in the Hilton lounge on the 18th floor away from the maddening crowd doing my day job. Oh well, it was still fun!
Cisco had a respectable booth this year. Many people were surprised to see our rack mount servers and asked: ”What is Cisco doing here?”. A few years ago it made more sense because Cisco had acquired TopSpin and was a top InfiniBand vendor. It wasn’t too much longer before Cisco shut down its InfiniBand business and went back to a converged network strategy for the data center that is based on Ethernet. So why would it be back at SC’12? We do sell servers now, so that’s different. But what would be compelling about Cisco UCS for an HPC solution?
It turns out a small team at Cisco including the great Jeff Squyres has been hard at work perfecting an ultra low latency solution for Ethernet. Jeff has been involved in MPI longer than he’d care to admit (even from way back when HPC clusters solely used gigabit Ethernet). He’s currently Cisco’s representative to the Open MPI project, which is what most people use today for HPC workloads. MPI is a library which applications use to run across distributed systems. I used to use these libraries a lot. Usually what you do is when you compile an application like HPL, WRF, Pallas, etc. we would point it to the MPI libraries we wanted to use. Then when the application runs it hooks into the APIs of those libraries and message passing happens. Yay!
What Jeff and team found was that due to the virtualization capabilities of the Cisco VICs (Virtual Interface cards), they actually lend themselves quite well to offloading MPI tasks. Before we get into that, lets make sure we understand the capabilities of the Cisco VIC.
Cisco VIC 1225
You can look at the specs here. It looks like a dual port 10GbE card. But what’s special about it is that its a CNA (converged network adapter) on steroids. The original intent was for virtualization. You can take this card and create 256 virtual interfaces. If this card were plugged into a Cisco Nexus switch, then the network administrator could potentially see 256 interfaces that he could apply policies to. What this means with VMware is that you can use hypervisor bypass (calledDirectPath I/O) to get better performance and network visibility. The Virtual Machine gets what it thinks is its own physical link. There have been several versions of this card. In rack mount servers, the P81E was the precursor. In the blades, the M81KR is the old model while the VIC 1240 and the VIC 1280 are the current incarnations.
HPC on the VIC 1225
Now that you know what the VIC 1225 does in the VMware use case, you can think about it a different way. Instead, think of all those VMs as just an application. After all, that’s all an Operating System is. Its just an application that runs other applications. And when we run DirectPath I/O we’re basically just bypassing the local switch and giving the VM its own access to the I/O. This can be perceived by the hypervisor that the particular lane is just running in user space. Well, that’s exactly what we do with HPC.
The idea is that when an MPI application first launches and starts to use the VIC for message passing it has to start up following the blue arrows in the diagram above. Once it hits the USNIC plugin to libibverbs it initiates its “boot strap” to set up the connection or to use one of the virtualized nics. To do this it has to go to the kernel level and use the standard TCP stack that takes so much time and eliminates the latency.
Once this bootstrapped configuration is complete, the plugin then sets up a queue pair with the NIC and follows the red line, bypassing the kernel and OS stack. This offloads CPU cycles and aids tremendously in cutting down the latency.
Jeff reports that they are able to get 1.7 microseconds when two of our C220 servers are connected back to back running a 19 byte packet. That packet is an L2 frame: 1 byte payload and 18 bytes for the frame (6 Bytes for source MAC, 6 bytes for destination MAC, 2 for Ethertype, 4 for CRC) . 1.7 microseconds back to back is pretty impressive. To give a comparison, Infiniband back to back connection latencies can range between 500-600 nanoseconds to around 1.29 microseconds.
When you couple this card with the new Cisco Nexus 3548 switch which delivers about 190 nanoseconds port to port, then the latencies we have observed will now come out to around 2.3-2.4 microseconds. YMMV. Here’s how it breaks down:
- With back-to-back Lexingtons (VIC 1225), HRT pingpong latency is 1.7us
- With N3548 algo boost, add 190ns latency
- Prototype Open MPI USNIC support adds 300-400ns (Jeff thinks we can reduce this number over time)
So the MPI HRT pingpong latency over N3548 is somewhere around 2.3us
For many HPC applications that’s enough to compelling especially for using standard Ethernet
The final benefit is the management capability that you get with UCS Manager. You can manage these rack mount environments the same way you manage blades. You can give them service profiles and you keep the settings the same using policies and pooled resources. Its very easy. While UCS Manager works today with the rack mount servers, in order to use the USNIC capabilities with the ultra-low latency, you’ll have to wait until sometime in 2013.
The bottom line now is cost. How much will it cost for Cisco gear compared to using standard rack mount servers with InfiniBand? I’m willing to bet that Cisco in HPC is going to be price competitive, but the real savings are in operational costs: You can get nearly the same benefits of InfiniBand without using a separate network. That’s one less network to configure, manage, and maintain. That’s one less vendor to have to deal with.
If you’re looking for a simple to manage HPC departmental cluster, this may just be the solution you’ve been waiting for.