{"id":666,"date":"2012-12-17T23:10:44","date_gmt":"2012-12-18T05:10:44","guid":{"rendered":"http:\/\/benincosa.com\/blog\/?p=666"},"modified":"2014-11-19T11:24:33","modified_gmt":"2014-11-19T17:24:33","slug":"cisco-sc12","status":"publish","type":"post","link":"https:\/\/benincosa.com\/?p=666","title":{"rendered":"Cisco @ SC&#8217;12"},"content":{"rendered":"<div>originally posted on November 16th, 2012. \u00a0Restored from Backup<\/div>\n<p>I just got back from\u00a0<strong><a href=\"sc12.supercomputing.org\">SC\u201912<\/a><\/strong> in Salt Lake City and it was as usual a fantastic event.\u00a0\u00a0One of my favorite times of the year.\u00a0\u00a0Most of this has to do with the fact that I\u2019ve been going to this conference for almost 10 years so I\u2019ve met a lot of great people.\u00a0\u00a0Perhaps the saddest part about the conference was that I didn\u2019t get to see nearly half of them.\u00a0\u00a0I spent way too much time in the Hilton lounge on the 18th floor away from the maddening crowd doing my day job.\u00a0\u00a0Oh well, it was still fun!<\/p>\n<p>Cisco had a respectable booth this year.\u00a0\u00a0Many people were surprised to see our rack mount servers and asked:\u00a0\u00a0\u201dWhat is Cisco doing here?\u201d.\u00a0\u00a0A few years ago it made more sense because Cisco had acquired TopSpin and was a top InfiniBand vendor.\u00a0\u00a0It wasn\u2019t too much longer before Cisco shut down its InfiniBand business and went back to a converged network strategy for the data center that is based on Ethernet.\u00a0\u00a0So why would it be back at\u00a0<strong>SC\u201912<\/strong>? We do sell servers now, so that\u2019s different.\u00a0\u00a0But what would be compelling about Cisco UCS for an HPC solution?<\/p>\n<p>It turns out a small team at Cisco including the great\u00a0<a href=\"http:\/\/blogs.cisco.com\/author\/JeffSquyres\/\">Jeff Squyres<\/a> has been hard at work perfecting an ultra low latency solution for Ethernet.\u00a0\u00a0Jeff has been involved in MPI longer than he\u2019d care to admit (even from way back when HPC clusters solely used gigabit Ethernet).\u00a0\u00a0He\u2019s currently Cisco\u2019s representative to the Open MPI project, which is what most people use today for HPC workloads.\u00a0\u00a0<a href=\"http:\/\/en.wikipedia.org\/wiki\/Message_Passing_Interface\">MPI is a library<\/a> which applications use to run across distributed systems.\u00a0\u00a0I used to use these libraries a lot.\u00a0\u00a0Usually what you do is when you compile an application like HPL, WRF, Pallas, etc. we would point it to the MPI libraries we wanted to use.\u00a0\u00a0Then when the application runs it hooks into the APIs of those libraries and message passing happens.\u00a0\u00a0Yay!<\/p>\n<p>What Jeff and team found was that due to the virtualization capabilities of the Cisco VICs (Virtual Interface cards), they actually lend themselves quite well to offloading MPI tasks.\u00a0\u00a0Before we get into that, lets make sure we understand the capabilities of the Cisco VIC.<\/p>\n<p><strong>Cisco VIC 1225<\/strong><\/p>\n<p>You can look at the specs\u00a0<a href=\"http:\/\/www.cisco.com\/en\/US\/prod\/collateral\/modules\/ps10277\/ps12571\/data_sheet_c78-708295.html\">here<\/a>.\u00a0\u00a0It looks like a dual port 10GbE card.\u00a0\u00a0But what\u2019s special about it is that its a CNA (converged network adapter) on steroids.\u00a0\u00a0The original intent was for virtualization.\u00a0\u00a0You can take this card and create 256 virtual interfaces.\u00a0\u00a0If this card were plugged into a Cisco Nexus switch, then the network administrator could potentially see 256 interfaces that he could apply policies to.\u00a0\u00a0What this means with VMware is that you can use hypervisor bypass (called<a href=\"http:\/\/blogs.vmware.com\/performance\/2010\/12\/performance-and-use-cases-of-vmware-directpath-io-for-networking.html\">DirectPath I\/O<\/a>)\u00a0\u00a0to get better performance and network visibility.\u00a0\u00a0The Virtual Machine gets what it thinks is its own physical link.\u00a0\u00a0There have been several versions of this card.\u00a0\u00a0In rack mount servers, the P81E was the precursor.\u00a0\u00a0In the blades, the M81KR is the old model while the VIC 1240 and the VIC 1280 are the current incarnations.<\/p>\n<p><strong>HPC on the VIC 1225<\/strong><\/p>\n<p>Now that you know what the VIC 1225 does in the VMware use case, you can think about it a different way.\u00a0\u00a0Instead, think of all those VMs as just an application.\u00a0\u00a0After all, that\u2019s all an Operating System is.\u00a0\u00a0Its just an application that runs other applications.\u00a0\u00a0And when we run DirectPath I\/O we\u2019re basically just bypassing the local switch and giving the VM its own access to the I\/O.\u00a0\u00a0This can be perceived by the hypervisor that the particular lane is just running in user space.\u00a0\u00a0Well, that\u2019s exactly what we do with HPC.<\/p>\n<p><a href=\"http:\/\/benincosa.com\/blog\/wp-content\/uploads\/2012\/12\/Screen-Shot-2012-11-16-at-3.00.45-PM.png\"><img decoding=\"async\" class=\"aligncenter size-full wp-image-667\" title=\"Screen Shot 2012-11-16 at 3.00.45 PM\" src=\"http:\/\/benincosa.com\/blog\/wp-content\/uploads\/2012\/12\/Screen-Shot-2012-11-16-at-3.00.45-PM.png\" alt=\"\" width=\"500\" \/><\/a><\/p>\n<p>The idea is that when an MPI application first launches and starts to use the VIC for message passing it has to start up following the blue arrows in the diagram above.\u00a0\u00a0Once it hits the USNIC plugin to libibverbs it initiates its \u201cboot strap\u201d to set up the connection or to use one of the virtualized nics.\u00a0\u00a0To do this it has to go to the kernel level and use the standard TCP stack that takes so much time and eliminates the latency.<\/p>\n<p>Once this bootstrapped configuration is complete, the plugin then sets up a queue pair with the NIC and follows the red line, bypassing the kernel and OS stack.\u00a0\u00a0This offloads CPU cycles and aids tremendously in cutting down the latency.<\/p>\n<p>Jeff reports that they are able to get\u00a0<strong>1.7 microseconds<\/strong> when two of our C220 servers are connected back to back running a 19 byte packet.\u00a0\u00a0That packet is an L2 frame: 1 byte payload and 18 bytes for the frame (6 Bytes for source MAC, 6 bytes for\u00a0\u00a0destination MAC, 2 for Ethertype, 4 for CRC) .\u00a0\u00a01.7 microseconds back to back is pretty impressive.\u00a0\u00a0To give a comparison, Infiniband back to back connection latencies can range between 500-600 nanoseconds to around\u00a0<a href=\"http:\/\/en.wikipedia.org\/wiki\/InfiniBand\">1.29<\/a> microseconds.<\/p>\n<p>When you couple this card with the new\u00a0<a href=\"http:\/\/www.cisco.com\/en\/US\/products\/ps12581\/index.html\">Cisco Nexus 3548 switch<\/a> which delivers about 190 nanoseconds port to port, then the latencies we have observed will now come out to around 2.3-2.4 microseconds.\u00a0\u00a0YMMV.\u00a0\u00a0Here\u2019s how it breaks down:<\/p>\n<p>&#8211; With back-to-back Lexingtons (VIC 1225), HRT pingpong latency is 1.7us<br \/>\n&#8211; With N3548 algo boost, add 190ns latency<br \/>\n&#8211; Prototype Open MPI USNIC support adds 300-400ns (Jeff thinks we can reduce this number over time)<br \/>\nSo the MPI HRT pingpong latency over N3548 is somewhere around 2.3us<\/p>\n<p>For many HPC applications that\u2019s enough to compelling especially for using standard Ethernet<\/p>\n<p>The final benefit is the management capability that you get with UCS Manager.\u00a0\u00a0You can manage these rack mount environments the same way you manage blades.\u00a0\u00a0You can give them service profiles and you keep the settings the same using policies and pooled resources.\u00a0\u00a0Its very easy.\u00a0\u00a0While UCS Manager works today with the rack mount servers, in order to use the USNIC capabilities with the ultra-low latency, you\u2019ll have to wait until sometime in 2013.<\/p>\n<p>The bottom line now is cost.\u00a0\u00a0How much will it cost for Cisco gear compared to using standard rack mount servers with InfiniBand?\u00a0\u00a0 I\u2019m willing to bet that Cisco in HPC is going to be price competitive, but the real savings are in operational costs:\u00a0\u00a0You can get nearly the same benefits of InfiniBand without using a separate network.\u00a0\u00a0That\u2019s one less network to configure, manage, and maintain.\u00a0\u00a0That\u2019s one less vendor to have to deal with.<\/p>\n<p>If you\u2019re looking for a simple to manage HPC departmental cluster, this may just be the solution you\u2019ve been waiting for.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>originally posted on November 16th, 2012. \u00a0Restored from Backup I just got back from\u00a0SC\u201912 in Salt Lake City and it was as usual a fantastic event.\u00a0\u00a0One of my favorite times of the year.\u00a0\u00a0Most of this has to do with the fact that I\u2019ve been going to this conference for almost 10 years so I\u2019ve met&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[990,157,992],"tags":[159,160,161,158],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/benincosa.com\/index.php?rest_route=\/wp\/v2\/posts\/666"}],"collection":[{"href":"https:\/\/benincosa.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/benincosa.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/benincosa.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/benincosa.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=666"}],"version-history":[{"count":2,"href":"https:\/\/benincosa.com\/index.php?rest_route=\/wp\/v2\/posts\/666\/revisions"}],"predecessor-version":[{"id":674,"href":"https:\/\/benincosa.com\/index.php?rest_route=\/wp\/v2\/posts\/666\/revisions\/674"}],"wp:attachment":[{"href":"https:\/\/benincosa.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=666"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/benincosa.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=666"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/benincosa.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=666"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}