This part 2 in my two part series on monitoring UCS. Part one dealt with analyzing data and making sense of what UCS Manager already collects and displays for you. This part will focus on alerting. In particular, our objective is to give us a warning when the bandwidth utilization goes above 80% and a critical alert when bandwidth goes above 90%.
Once again I will be following the slides presented at Cisco Live by Steve McQuerry session ID BRKCOM-2004 in San Diego earlier this year. You can get those too by visiting http://ciscolive365.com (login required).
First… some math
We will assume our links are simple 10GbE links. If we hit 80% and 90% then we are looking to monitor when bandwidth hits 8Gbps and 9Gbps. Easy math right? But unfortunately UCS reports new bytes collected every 30 seconds. Therefore, we need to convert Gbps into Bytes / 30 seconds and monitor for that number.
The math is still simple but the concept of converting units can be a little frustrating. Here is how we do it:
x Gbps * (30 seconds ) * (1,000,000,000 bits / 1Gb ) * (1 byte / 8 bits) ~ 3,750,000,000
or you could argue there are 1,073,741,824 bits per gigabit. In which case you would have:
x Gbps * (30 seconds) * (1,073,741,824 bits / 1 Gb ) * ( 1 byte / 8 bits) ~ 4,026531840
I’ve seen it both ways and I’m not going to argue with it. To be consistent with the previous post I’ll use 4,026,531,840 as my multiplier. So multiply the expected Gbps by that number 4,026,531,840.
Here’s a table that takes common speeds that we’ll be interested and converts them:
Now that we know what we are looking for, lets create some alerts. There are 3 hotspots to consider in UCS: The bits leaving the server adapter, the FEX to Fabric Interconnect, and the Fabric Interconnect to upstream switch. Let’s start by looking at the server adapter.
Step 1: Create the Threshold Policies
From the LAN tab, filter by policies and navigate to Threshold Policies
Right click the Threshold Policies and select “Create Threshold Policy”. We’re going to create a new Threshold Policy and call it 10Gb-Policy
Select ‘Next’ and add a Stat Class. We’re going to add Vnic Stats:
The next screen is for creating our definitions. We’re going to create 2 definitions: 1 for Rx Bytes Delta and 1 for Tx Bytes Delta. We’ll create a major event (when network bandwidth hits 90% of 10Gbps) and a minor event (When network bandwidth hits 80% of 10Gbps). We also need to put a value in for when the alarm will stop. We can use 85% for the major alarm and 75% for the minor alarm. This means if network bandwidth hits 80%, then we’ll trigger a warning and that minor alarm won’t go away until network bandwidth goes down to 75%. Similarly, if network bandwidth hits 90% then we’ll trigger an alert and it won’t subside until network bandwidth utilization goes below 85% or 8.5Gbps in this case.
Using our table from above we now fill in the blanks for the Tx Delta:
We also need to do this for the Rx Delta after saving this off. This should look identical to the Tx Delta with the Property Type being the only difference. When we’re done we have a nice Threshold Policy:
Step 2: Associate the Threshold Policy to a vNIC Template
Since we use LAN connectivity templates, we only need to modify our LAN connectivity templates on the nodes we are using to include our new 10Gb-Policy. If you don’t, you’ll have to go modify every vNIC on every service profile.
From the LAN tab, filter by Policies open the VNIC Templates and select the VNIC Template you used on your virtual machines. Change the Stats threshold policy to match the 10Gb-Policy we just created and save changes:
Do this for all VNICs Templates. If you configured them as updating templates (hopefully) then you shouldn’t have to do anything else and they’ll all be monitored.
Step 3: Repeat for Uplinks
From the LAN tab, filter by LAN Cloud. Add to the default policy the same steps you did in step 1. You should have etherRxStats and etherTxStats when you are done. This will then be applied to the uplinks provided they are not port channels. This applies to single links. To deal with port channels, you simply click on the port channel and edit it there.
Step 4: Repeat for FEX connections
From the LAN tab, filter by Internal LAN. Add to the default policy (you won’t be able to create a new policy). This will be the same values as you had in the previous step.
Good! That was a lot of typing. You are now ready to be alerted!
To see if this really works we used the iperf benchmark. For the Windows operating system you can use the jperf benchmark. In my lab I created 2 Red Hat Linux VMs named iperf1 and iperf2. I then loaded them up on two different vSphere ESXi hosts. I created an anti-affinity policy so that they would not be migrated to the same host. The hosts were located at chassis 1 blade 1 and chassis 2 blade 1. We made the traffic leave the Fabric Interconnects by tying one VM to the vNIC on the A side and the other VM tied to the VM on the B side. This looks similar to the logical diagram below:
On iperf1 I ran:
[root@iperf1 ~]# iperf -s -f m
That is the server. Then on the other host I ran:
[root@iperf2 ~]# while iperf -c 192.168.50.151; do true; done
It wasn’t long before we saw errors going all the way up through the stack:
Looks like our alerting works!
In this post we showed how to get alerts when bandwidth gets too high. We used a constant of 4,026,531,840 to multiply with the desired Gigabits per second that we are interested in monitoring. We created threshold policies on the NICs, the FEXs and the Fabric Interconnect uplinks. We then tested to see that errors were generated all the way through when the bandwidth got to high.
Hopefully this helps you get a better idea of what is happening inside your UCS. Now you can decide whether you really need all those uplinks or not. If not, then you can use those ports for other things.
I want to mention here that we only focused on the Ethernet side of things. The Fibre Channel network follows a very similar process. When troubleshooting suspected bandwidth issues, be sure to examine your fibre channel traffic as well.
Finally, I want to thank Steve McQuerry (the coolest last name any database guru could ever have) for helping me understand how UCS monitoring and alerting works. He’s written some great slides, given great presentations, and has some other things in the works.
Whenever we discuss monitoring systems, we usually need to start by understanding what we mean by monitoring. Usually its two related definitions: Monitoring on one hand means looking at data, gaining visibility into what is happening on the system and being able to analyze it. Monitoring also means alerting: Let me know when something happens. You may then respond to the event in some way.
UCS can do both definitions of monitoring. And since monitoring has two parts, this blog will have two parts. In this part (part 1) we’ll examine how to look at UCS and understand what is happening in the system. The next post (part 2) will talk about how to be alerted.
Lets examine the data by answering one of the most common questions we run across with UCS: How many connections do you need from the Fabric Extenders (aka: FEX aka IO Module aka 2104/2204/2208) to the Fabric Interconnects. Mostly what I see is from 2 to 4 connections per FEX to Fabric Interconnect. But it would be great if you could determine how much bandwidth is actually being used to scientifically decide whether you need more or less cables. And it turns out you can free of charge with UCS Manager. Since we are trying to answer this question, we’ll be focusing on monitoring the network in UCS. Keep in mind, however, that you can also monitor the power consumption, temperature, and error statistics of many of the other components.
To answer this question it takes a little math and a little bit of poking around to figure it out. Steve McQuerry presented at Cisco Live session ID BRKCOM-2004 in San Diego earlier this year. My blog is based off some of his slides which you can get at http://ciscolive365.com (free login required), but my math is daringly original, so please let me know if I’ve made errors.
Let’s first look and see how UCS collects data. On UCS manager navigate to Admin, then filter by Stats Management. From here you will see the collection policies. By default each collection policy has a collection interval of 1 minute and a reporting interval of 15 minutes.
So what does that actually mean?
Collection Interval: How often the data will be collected. We are encouraged to change the collection interval to 30 seconds to get more granulated data. This means that every 30 seconds, the device will be queried by the UCSM subprocess responsible for gathering statistics from the underlying NXOS.
Reporting Interval: How often data will be stored to the UCS Manager. While we set the collection interval to 30 seconds, the reporting interval is how often it is stored in UCS Manager. So we might take our first interval at 9:11AM then the next would be at 9:26, and then every 15 minutes after that. UCS can only hold up to 5 of these records. That alone should tell you that UCS is not good for long term trend analysis. It is recommended that another monitoring solution be used for greater detail.
Cisco recommends that you change the collection interval to 30 seconds for the things you’re interested in. The reporting interval doesn’t really matter for what we’re doing here.
Examining FEX bandwidth
I have a first generation IOM so the traffic is not trunked from blade to Fabric Interconnect. It follows a defined path based on the number of uplinks. (see this great post: http://jeremywaldrop.wordpress.com/2010/06/30/cisco-ucs-ethernet-frame-flows/ for information on how its connected internally)
I have 2 chassis, each connected with 2 ports. Ports 1 & 2 connect to chassis 2 and Ports 3 & 4 connect to chassis 1. (Yes, this is not good form, but hey, I inherited this lab so that’s just the way it is and I haven’t bothered to fix it). To see how your chassis are connected to the Fabric Interconnect, click on the Equipment tab, select the Chassis, and then select Hybrid display from the work pane
That should tell you how the connections are made from FEX to Fabric Interconnect.
Now let’s now look at one of the FEX uplinks. Navigate to the Equipment tab, filter by Fabric Interconnects and look at the server ports that are connected to Fabric Interconnect A:
Select the first port and lets look at the statistics tab in the work pane:
To measure bandwidth, we are interested in the delta of total bytes received (Rx) and transmitted (Tx) on each of the FEX uplinks. This particular uplink for Received and Transmitted Total Bytes shows 837,101 and 691,921 bytes respectively.
We typically measure I/O in Gbps, Mbps, or Kbps. So we need to translate these numbers. This is where the math comes in. First, remember, that our collection interval is 30 seconds. That means, that number reported is x bytes in 30 seconds. To get bytes per second, just divide that number by 30. From there, do the type of multiplication you may have learned in your physics class when converting between different forms of measurements. Here’s the formula for Gbps and Mbps:
Bytes to Gbps from 30 second interval collection period
= x * 0.000000000248 Gbps
(x bytes / 30 seconds) * (8 bits / 1 byte ) * ( 1 Gb / 1,073,741,824 bits )
** Note: You could argue that there are only 1 million bits in a Gigabit, go ahead and use that if it makes you more comfortable.
Bytes to Mbps from 30 second interval collection period
Probably easier to do this in Mbps:
= x * 0.000000254 Mbps
(x bytes / 30 seconds) * (8 bits / 1 byte) * (1 Mb / 1,048,576 bits)
Just looking at those formulas (or multipliers as they really are), there are some simple rules we can follow:
Rule 1: If the delta is not a 10 digit number or greater then you are not even doing a Gigabit per second on a 10 Gigabit link.
Rule 2: If the delta is not a 7 digit number or greater then you are not even doing a Megabit per second on a 10 Gigabit link.
Armed with this knowledge, we do our math:
Rx: 837,101 * 0.000000254 = .212 Mbps = 212 kbps
Tx: 691,921 * 0.000000254 = .1757 Mbps = 175.7kbps
Not a lot going on in this link is there?
After looking at the rest of the links on the system they were all in the same 6 figure range with one exception: One link (Fabric B, port 1) had Rx at 13,082,674 and Tx at 3,241,484 which is about 1.5 Mbps and 823 kbps
Now, how can I find out what server is generating all that traffic? (Let’s just suppose that 1.5 Mbps is a lot for pedagogical purposes)
Examining Server vNIC bandwidth
Since I have 2 cables per FEX I know that Fabric B uplink 1 is connected to all the B-side uplinks on odd slots. (Remember this post?)
All the even slots are connected to the 2nd one. So this has to be either blade 1, 3, 5, or 7. So what I have to do is check which Service Profiles are in those slots. From the equipment tab I determine that I have:
Slot 1: ESXi-1000v-02 -> Slot 1
Slot 3: Empty
Slot 5: CIAC-ESXi4.1-02
Slot 7: Empty
I only have to check 2 servers. On each server I have assigned a LAN connectivity connection so I know which vNIC is going out the B side. From here its just a matter of finding the chatty one. Here’s how I found my Most Chatty Server Port (MCSP):
Since I’ve labeled them, its pretty obvious which ones go out the B side. Click on each vNIC and from the work pane, select statistics. We expand the statistics and see a familiar screen. But this time, we look under vNIC stats:
After examining each of them I can see that the chatty interface is my NFSB vNIC. Its doing a lot of work! And accounts for most of the change in deltas. This is one of the reasons I recommend on UCS doing more than just the two default vNICs. You get to see in hardware what is happening. We found our most chatty server port and gained a lot of insight as to what this idle system is doing.
If you did not find any chatty activity in the vNICs it might be the Fibre Channel. Remember, we are doing FCoE from the Adapter to the Fabric Interconnects. Try checking the counters there.
Examining UCS Uplink Bandwidth
To finish off this post, lets look at the uplinks coming out of the Fabric Interconnect. This works differently if you have a Port-Channel or standard uplinks. For Port-Channel, you would go to the LAN tab, select the port-channel from the LAN cloud and then look at the statistics there.
If you do not have a port-channel configured, you can do it from the Equipment tab like we did before with the Server Ports (aka: FI to FEX ports). From the equipment tab, filter by Fabric Interconnect and select the uplink ports:
From here, look at the Rx and Tx total bytes delta to get an idea of how things are changing. Pretty simple right? Just look for greater than 10 digit deltas for hot spots.
Part 1 Summary
The purpose of this post was to help you understand what total network traffic looks like inside your UCS environment. There are 3 spots to consider when understanding traffic patterns: The server adapters, the FEX, and the uplinks. Knowing how to read the statistics and make sense of them can help you quickly find hot spots. The basic rule is that any delta in the Total Bytes Rx or Tx that has more than 10 is worth looking at and multiplying by 0.000000000248 to get the total Gbps.
It is worth pointing out that you can also select the ‘Chart’ option under any of the statistics tool to see a trend. When dealing with Rx and Tx deltas, you’ll have to modify the range of the scale otherwise it will seem that there is no data.
Lastly, for long term analysis a different tool is needed. UCSM only gives you a brief snapshot as there is not room to store it all in UCS Manager. Open source tools like Cacti, Nagios, Zenoss, and Zabbix can help do this. Solarwinds is also a popular commercial product that helps in performance tracking.
In my next post, I’ll talk about monitoring thresholds so that you can have UCS generate an alarm if network traffic gets too high.
Credits: Steve McQuerry, Craig Schaff, David Nguyen, and Dan Hanson. Thanks guys!
Create Device Aliases
device-alias database device-alias name <alias> pwwn <pwwn> exit end device-alias commit
Create a new zone
config zone name <zonename> vsan <xxx> member device-alias <alias> exit
Create a new Zoneset
config zoneset name <zonesetname> vsan <vsan> member <zonename> zoneset activate name <zonesetname> vsan <vsan>
The default timeout for failover of a UCS fabric Interconnect is 5 seconds. Want to change that? Check this out.
If you fail over the primary fabric interconnect (which UCS manager will be running on) you’ll be logged out of UCS manager. No worry, just wait 5 seconds and log back in. You’ll be up on the primary.
When you fail over both of them to test, make sure your HA is back up and running before failing one of them over. Just log in via SSH:
connect local-mgmt show cluster stat A: UP, PRIMARYB: UP, SUBORDINATE HA READY
This will tell you that the cluster is ready. At this point you should be able to unplug one of the Fabric Interconnects to test that failover works.
When they come back on line, you may want to change who the primary Fabric Interconnect is. To do this, once again, SSH into the fabric interconnect:
cluster lead a
Once, we didn’t let the HA get ready and we had to run cluster force primary to make sure the subordinate (who hadn’t been synced yet) become the primary.
There’s a lot of collateral about why the Nexus 1000v would be a good thing to have in your virtual environment. When I talk to people about it one of my first questions is:
“Who manages the virtual networking environment in the data center?”
Most of the time its the virtual machine administrators. Its usually not the networking team. Typically the network team stops at the physical access layer and anything on the server is the responsibility of the server administrator (which is also the VM administrator)
If the shop is big enough, the second question I usually ask is:
“Would you like the networking team to manage the virtual networking environment?”
Most of the time this is greeted enthusiastically. After all, the networking team has to troubleshoot the VMware environment anyway. Why not just give them control of it? That’s one less problem the virtual administrative team has to deal with.
That’s one of the best benefits of the Nexus 1000v. Those old lines of demarkations are back. And the cool thing? Network visibility is back with a consistent command line.
Here’s a video I made to show this line of demarkation in action
Unfortunately, I’m not very coherent in the video but I hope you get the idea. Also, sorry for the command line not being visible while I’m typing. Hopefully I’ll get better with this in time.
Layer 3 mode is the recommended way to configure VSM to VEM communication in the Nexus 1000v. Layer 3 mode keeps things simple and easier to troubleshoot.
I kept my design very simple. There’s one VLAN (509) that I run my ESXi hosts on. The IP addresses are 192.168.40.xxx. Just to give you an example:
ESXi Host1: 192.168.40.101
ESXi Host2: 192.168.40.102
Using this I had a simple uplink port-profile defined:
nexus1000v(config-port-prof)# show port-profile name uplink port-profile uplink type: Ethernet description: status: enabled max-ports: 32 min-ports: 1 inherit: config attributes: switchport mode trunk switchport trunk allowed vlan 1,501,506,509-510,714,3967 channel-group auto mode on mac-pinning no shutdown evaluated config attributes: switchport mode trunk switchport trunk allowed vlan 1,501,506,509-510,714,3967 channel-group auto mode on mac-pinning no shutdown assigned interfaces: port-channel1 port-channel2 Ethernet3/1 Ethernet3/2 Ethernet4/1 Ethernet4/2 port-group: uplink system vlans: 1,501,506,509-510,714 capability l3control: no capability iscsi-multipath: no capability vxlan: no capability l3-vn-service: no port-profile role: none port-binding: static
And a simple management port-profile:
nexus1000v(config-port-prof)# show port-profile name management port-profile management type: Vethernet description: status: enabled max-ports: 32 min-ports: 1 inherit: config attributes: switchport mode access switchport access vlan 509 no shutdown evaluated config attributes: switchport mode access switchport access vlan 509 no shutdown assigned interfaces: Vethernet1 Vethernet4 port-group: management system vlans: 509 capability l3control: yes capability iscsi-multipath: no capability vxlan: no capability l3-vn-service: no port-profile role: none port-binding: static
I had everything set up right… I thought. The only problem was (before, not in the output above) is that I couldn’t see my VEMs! They were all hooked up in vCenter and I was even running traffic through them. But no VEMs:
nexus1000v(config)# show module vem No Virtual Ethernet Modules found.
I finally stumbled upon this nice document and realized I hadn’t enabled l3control. Doing that:
nexus1000v(config-port-prof)# capability l3control
And Bam! Everything worked:
nexus1000v(config-port-prof)# show module vem Mod Ports Module-Type Model Status --- ----- -------------------------------- ------------------ ------------ 3 248 Virtual Ethernet Module NA ok 4 248 Virtual Ethernet Module NA ok Mod Sw Hw --- ------------------ ------------------------------------------------ 3 4.2(1)SV1(5.1a) VMware ESXi 5.0.0 Releasebuild-469512 (3.0) 4 4.2(1)SV1(5.1a) VMware ESXi 5.0.0 Releasebuild-469512 (3.0) Mod MAC-Address(es) Serial-Num --- -------------------------------------- ---------- 3 02-00-0c-00-03-00 to 02-00-0c-00-03-80 NA 4 02-00-0c-00-04-00 to 02-00-0c-00-04-80 NA Mod Server-IP Server-UUID Server-Name --- --------------- ------------------------------------ -------------------- 3 192.168.40.101 00000000-0000-0000-cafe-00000000000f 192.168.40.101 4 192.168.40.102 00000000-0000-0000-cafe-00000000000e 192.168.40.102
The CCIE Data Center exam was announced in March of this year. The list of topics is quite comprehensive. I for one was stoked to see it announced as I wasn’t even thinking about doing a CCIE until this came up.
After some prodding from my team mates, I signed up for the Beta written exam and I took it today. 120 questions covering UCS, Nexus 7000, 5000, 1000v, MDS. I don’t know the results of the test because the exam is in beta form and they won’t give out a passing score until after the beta period ends.
My overall feeling of the written exam in its current encarnation is that it is passible. The UCS stuff I know pretty well. The other topics… well, I could use some work. But having taken it (and after all its only $50) then I think I’m ready to get serious and go for the CCIE. I’m setting a timeline of Fall 2013 to have it passed. Guess we’ll see.
My day job is to be an advocate for Cisco UCS in my customer’s data center. Its a great gig. Its much easier to back a product when you actually believe in it. I thought I’d write down some of the ideas that I talk about with my customers on this blog.
Rainbows In the Data Center
Rack mount servers are still all the rage in many organizations. And so is Gigabit Ethernet. VMware best practices suggest that you have separate networks for management, vMotion, I/O, and VM traffic. Using NIC Teaming you get something that looks like this beautiful picture:
(source: Not sure, some Cisco person’s power point I stole)
You’ll notice that people color code these cables so they can tell which network goes to what. The result is a beautiful rainbow flowing out of each server. Then, Rainbow Brite and her sprite buddy Wink and her stallion Starlite can aid you in managing this big mess.
The account team at Cisco responsible for selling you switches loves this because for every cable you buy, you need to connect it to a switch port. This is why we tell everybody that going with UCS is a strategic decision. Do you want to continue investing in lots of Gigabit Ethernet Switches or consolidate with 10GbE? UCS gets rid of the rainbows. Yes, rainbows are pretty, but you don’t want them to get out of hand. Rainbows do strange things to people (especially if you have more than one). Sometimes they can be just too much, and too intense.
UCS gets you instant 10GbE as well as consolidation. I did a comparison for a customer who was looking to buy a “pod” of rack mount servers and compared it to what it would take to buy the equivalent UCS. Each “pod” consisted of 44 2-socket CPU servers. Each server had 6 Gigabit Ethernet ports required as well as 2 HBAs to connect to the SAN. The comparison was pretty eye opening. By going with a UCS strategy the following benefits were realized:
- 10% reduction in acquisition cost (some of this had to do with not having to buy new network switches)
- 57% reduction in physical rack space
- A dramatic difference in Ethernet cables required: 28 compared to 308
- A dramatic difference in Fibre Channel ports required: 4 compared to 88
Think this can happen with Legacy Non-UCS blades? Not so much. There is a savings still, but nothing as dramatic. So if you like dealing with more infrastructure, stringing cables, and configuring port policies on your network switches for every server in your environment, UCS may not be for you.
UCS is the new Catalyst 6000
I get the chance to walk into the belly of many data centers. One common feature that we see there is the venerable Cisco Catalyst 6000 switch. Cisco has been milking this baby since 1999.
What makes this thing so successful? Removable line cards and supervisor line cards. As people have migrated from fast ethernet to gigabit Ethernet to 10 Gigabit Ethernet they’ve just upgraded the line cards or supervisor modules. They made the strategic decision years ago to go with this and its working out great.
UCS Fabric Interconnects have a similar value proposition in the compute space. You buy them as part of your strategy and then adding blades is just like adding remote line cards instead of fixed line cards in a 6509. Going with this strategy provides several benefits:
- Cost effective way of getting servers on line. These Cisco servers are extremely price competitive and very attractive. Once you have the infrastructure, adding blades is so cost compelling its hard to see the rational of not going with another blade when you have the chance. This isn’t to say you’re necessarily locked into Cisco. You still have options. Its just that the other options are not as attractive any more.
- Let’s suppose that 5 years from now Cisco decides it wants to start selling some esoteric micro servers in a new chassis or something. (Full disclosure: I have no idea if they are planning this and have seen nothing on the roadmap). Let’s suppose that this new chassis has 100 slots for these servers. If you bought any other blade, you’d need to throw out the old architecture and buy the new chassis. With UCS, you just buy the new chassis and add these fun esoteric servers. The Fabric Interconnects and the Fabric Extenders on the back will still work the same way. In essence: You’ve future proofed your architecture.
So what if everything goes 100GbE? Fine, swap out Fabric Interconnects just like you swap out supervisor line cards on the 6500. The architecture is brilliant. What do you do with competitor solutions? You have to throw out, start over. The Fabric Interconnect architecture is just something that builds and has the ability to build as time goes on.
The Soul of a Server
In addition to the nice architecture of UCS, another compelling feature is how we manage UCS servers. The idea is a bit different than how you did things in the past. Back in the day, when you wanted to set up a new server, you would plug it in, hook it up to a crash cart, turn it on and then press F2 and do cool things like:
- Tune BIOS settings
- Set boot order
- Program iSCSI interfaces
- configure RAID
Then you might go through and do several updates. Of course you wrote it all down right? Its not hard for these things to get out of sync. And guess what causes problems in application performance? When the infrastructure you thought was homogeneous is not homogeneous. This is a pain and takes a lot more time than people readily admit.
With UCS we do it differently. Gone are the days of pressing F2. I’ve never pressed F2 while a machine is booting on UCS Blades. Here’s the new way: There is a place in UCS Manager where all your wildest dreams and fantasies can come true. This is where we logically define what we want our servers to look like. We go through and say to ourselves: In my fantasy world, if I could have a server, I’d want it to PXE boot, then boot to hard drive. I’d want its BIOS settings to have hyperthreading. I’d also like its firmware level to be 2.0(2) and I’d like its RAID to be set to RAID1 mirroring.
That’s exactly how you do it. You define a template that has all the characteristics of the server you want and then spawn instances of it (called Service Profiles). You then take those spawns, or service profiles and possess the hardware of the physical blade you assign it to. Its like you create the soul of the server then give that soul a body.
This is pretty cool because now if you want to change something, you change it at the template and it can in turn update all the spawns of it. You can create multiple templates for different types of servers, each optimized for the application that the server is supporting. So you might have a service profile template for ESXi, a template for Oracle, a template for Windows bare metal, or a template for RHEV. You are in the business of managing server souls. Its more nobel.
These are just a few of the many benefits of UCS that I thought I’d write down. There’s always situations where other products may be more applicable. But UCS is definitely one to check out. And just in case you missed it: UCS is the 3rd best selling x86 blade server world wide after HP (1) and IBM (2). In the US, UCS is ranked #2 behind HP. Not bad for a server that’s only been on the scene since 2009.
One of the cool things that UCS allows you to do is create a place where different users of different organizations can go to to configure their pools of resources. Its a common goal for many organizations to reduce duplication and allow agility and flexibility. A multi-tenant solution that has been talked about can actually become a reality with UCS in the form of Role Based Access Control (RBAC).
Let’s suppose that a local county has decided it wants to consolidate its IT infrastructure into its IT department as opposed to every department having its own IT instances. It can start off slowly, by say, starting with one or two organizations like the department of Superior Courts and the department of Executive Services.
Here’s how the main IT organization might configure RBAC for the Superior Courts and the Department of Executive Services.
1. Create suborganizations
Log in as admin and navigate to the Servers tab. From there you can expand the Service Profiles and see “root” and “Sub-Organizations”. Right click on “root” and add an organization:
2. Create Locale
A locale in UCSM is designed to reflect the location of a user in an organization. By default all users are at the ‘root’ level locale, but if we are creating sub-organizations, we want them to use their own stuff and not modify existing resources that exist at the root level, or with other organizations.
Navigate to the Admin Tab in navigation pane, filter by User Management, expand User Services and right click on Locales.
Next, to assign the organization, we just expand the Organizations menu and drag the Superior_Court into the pane on the right.
3. Create a User for the Organization
Now let’s create a user called sc-admin that has all the rights in the Superior_Court local, but can’t change things in the root locale or any other locales.
On the navigation pane in the same place you were on the previous step, right click Locally Authenticated Users and select ‘Create User’.
The first fields are pretty self-explanatory. We created the user and password and left out some of the other information. The important part is that the locale is set to Superior_Court. This confines the powers of this user into Superior_Court. We can then select all the roles except the following:
- aaa: Authentication, Authorization, and Accounting. This can only be done in the root locale
- admin: this can only be given in the root locale
- operations: can only be given to root locale.
Now then… What can sc-admin do?
If you now log in as sc-admin, you can see that he can create service profiles, pools, and policies, but only in his superior_court suborg. If sc-admin tries to create a resource in the root organization, he is blocked from doing so because all of the options are greyed out:
Here’s what else he can do:
- He can create sub organizations within his own Sub-organization.
- He can create VLANs in the LAN and enable and disable network ports on the Fabric Interconnects. (because he was given network access… if you don’t want this take away the network privilege)
- He can create VSANs and disable and enable FC interfaces. (take away the storage privilege if you don’t want him to do this)
An interesting scenario I ran across is that if you remove a role from a user while that user is still logged in, it doesn’t seem to take effect until the user logs in later. For example, I disabled sc-admin’s network role and he was still able to create VLANs and turn ports off and on. When I logged him out and logged him back in again, the role acted how it should have.
One of the disadvantages of disabling the network role is that sc-admin can’t create VNIC Templates. This is something we might want to allow him to do in his own org. We can change this by creating a new role in the user management entitled Network_SP. For this role, we just check:
- Service Profile Network
- Service Profile Network-Policy
- Service Profile Qos
- Service Profile Qos Policy
Now sc-admin can create vNIC templates in his own sub org, but he isn’t allowed to create external VLANs and disable/enable ports on the Fabric Interconnect. For this to take affect, have sc-admin log out and log back in again after you apply the role.
You can do something very similar on the Storage tab in order to allow a suborg to create and modify its own vHBA Templates but not be able to disable FC ports on the Fabric Interconnects.
Once this is in place, you can repeat the operation for the department of Executive Services. As other departments join the consolidated data center their users are simply added to the locales and given roles.
I’ve been going a little app crazy to start out this year, and I’m very pleased with the results. With the help of others, I’ve released updates to the two Cisco based apps: UCS Tech Specs, and FlexPod Tech Specs. And I’ve finally released the xCAT iOS client! Hurray!
I’ve been doing all this for the past several months during that precious moments I have after the kids go to bed and I drift off to sleep. Lucky for me, my wife has enough interesting projects going on in her life that she doesn’t miss me… too much! Don’t get me wrong: We still find time to go out and have a great time. And for those times when my day job also becomes my night job, you can see why it takes a long time for many of these projects to get done. Whew!
There are also many other projects cooking. With my coworker Tige Phillips at Cisco, we are slowly creating SiMU HD, an iPad version of SiMU pro that will manage UCS systems. …Well, I should restate that: He’s doing most of the work and I’m lending a hand!
I’ve also thought about starting a little game development? How about a game for managing clusters? A game for managing UCS that gives you prizes for learning how to do certain cool features? Ha! Yes, I have a lot of bad ideas! Hope you have a great February!