UCS Monitoring Part 2: Alerts for Bandwidth

This part 2 in my two part series on monitoring UCS.  Part one dealt with analyzing data and making sense of what UCS Manager already collects and displays for you.  This part will focus on alerting.  In particular, our objective is to give us a warning when the bandwidth utilization goes above 80% and a critical alert when bandwidth goes above 90%.

Once again I will be following the slides presented at Cisco Live by Steve McQuerry session ID BRKCOM-2004 in San Diego earlier this year.  You can get those too by visiting http://ciscolive365.com (login required).

First… some math

We will assume our links are simple 10GbE links.  If we hit 80% and 90% then we are looking to monitor when bandwidth hits 8Gbps and 9Gbps.  Easy math right?  But unfortunately UCS reports new bytes collected every 30 seconds.  Therefore, we need to convert Gbps into Bytes / 30 seconds and monitor for that number.

The math is still simple but the concept of converting units can be a little frustrating.  Here is how we do it:

x Gbps *  (30 seconds ) * (1,000,000,000 bits / 1Gb )  * (1 byte / 8 bits) ~ 3,750,000,000

or you could argue there are 1,073,741,824 bits per gigabit.  In which case you would have:

x Gbps * (30 seconds) * (1,073,741,824 bits / 1 Gb ) * ( 1 byte / 8 bits) ~ 4,026531840

I’ve seen it both ways and I’m not going to argue with it.  To be consistent with the previous post I’ll use 4,026,531,840 as my multiplier.  So multiply the expected Gbps by that number 4,026,531,840.

Here’s a table that takes common speeds that we’ll be interested and converts them:

Bytes/30 second multiplier 4,026,531,840
Speed in Gbps Bytes / 30 seconds
1 4,026,531,840
5 20,132,659,200
7.5 30,198,988,800
8 32,212,354,720
8.5 34,225,520,640
9 36,238,786,560
10 40,265,318,400
16 64,424,509,440
18 72,477,573,120

Creating Alerts

Now that we know what we are looking for, lets create some alerts.  There are 3 hotspots to consider in UCS:  The bits leaving the server adapter, the FEX to Fabric Interconnect, and the Fabric Interconnect to upstream switch.  Let’s start by looking at the server adapter.

Step 1: Create the Threshold Policies

From the LAN tab, filter by policies and navigate to Threshold Policies

Right click the Threshold Policies and select “Create Threshold Policy”.  We’re going to create a new Threshold Policy and call it 10Gb-Policy

Select ‘Next’ and add a Stat Class.  We’re going to add Vnic Stats:

 

The next screen is for creating our definitions.  We’re going to create 2 definitions:  1 for Rx Bytes Delta and 1 for Tx Bytes Delta.  We’ll create a major event (when network bandwidth hits 90% of 10Gbps) and a minor event (When network bandwidth hits 80% of 10Gbps).  We also need to put a value in for when the alarm will stop.  We can use 85% for the major alarm and 75% for the minor alarm.  This means if network bandwidth hits 80%, then we’ll trigger a warning and that minor alarm won’t go away until network bandwidth goes down to 75%.  Similarly, if network bandwidth hits 90% then we’ll trigger an alert and it won’t subside until network bandwidth utilization goes below 85% or 8.5Gbps in this case.

Using our table from above we now fill in the blanks for the Tx Delta:

 

We also need to do this for the Rx Delta after saving this off.  This should look identical to the Tx Delta with the Property Type being the only difference.  When we’re done we have a nice Threshold Policy:

Step 2:  Associate the Threshold Policy to a vNIC Template

Since we use LAN connectivity templates, we only need to modify our LAN connectivity templates on the nodes we are using to include our new 10Gb-Policy.  If you don’t, you’ll have to go modify every vNIC on every service profile.

From the LAN tab, filter by Policies open the VNIC Templates and select the VNIC Template you used on your virtual machines.  Change the Stats threshold policy to match the 10Gb-Policy we just created and save changes:

Do this for all VNICs Templates.  If you configured them as updating templates (hopefully) then you shouldn’t have to do anything else and they’ll all be monitored.

Step 3:  Repeat for Uplinks

From the LAN tab, filter by LAN Cloud.  Add to the default policy the same steps you did in step 1.  You should have etherRxStats and etherTxStats when you are done.  This will then be applied to the uplinks provided they are not port channels.  This applies to single links.  To deal with port channels, you simply click on the port channel and edit it there.

Step 4:  Repeat for FEX connections

From the LAN tab, filter by Internal LAN.  Add to the default policy (you won’t be able to create a new policy).  This will be the same values as you had in the previous step.

Good!  That was a lot of typing.  You are now ready to be alerted!

Testing Alerts

To see if this really works we used the iperf benchmark.  For the Windows operating system you can use the jperf benchmark.  In my lab I created 2 Red Hat Linux VMs named iperf1 and iperf2.  I then loaded them up on two different vSphere ESXi hosts.  I created an anti-affinity policy so that they would not be migrated to the same host.  The hosts were located at chassis 1 blade 1 and chassis 2 blade 1.  We made the traffic leave the Fabric Interconnects by tying one VM to the vNIC on the A side and the other VM tied to the VM on the B side.  This looks similar to the logical diagram below:

On iperf1 I ran:

That is the server.  Then on the other host I ran:

It wasn’t long before we saw errors going all the way up through the stack:

Looks like our alerting works!

Conclusion

In this post we showed how to get alerts when bandwidth gets too high.  We used a constant of 4,026,531,840 to multiply with the desired Gigabits per second that we are interested in monitoring.  We created threshold policies on the NICs, the FEXs and the Fabric Interconnect uplinks.  We then tested to see that errors were generated all the way through when the bandwidth got to high.

Hopefully this helps you get a better idea of what is happening inside your UCS.  Now you can decide whether you really need all those uplinks or not.  If not, then you can use those ports for other things.

I want to mention here that we only focused on the Ethernet side of things. The Fibre Channel network follows a very similar process.  When troubleshooting suspected bandwidth issues, be sure to examine your fibre channel traffic as well.

Finally, I want to thank Steve McQuerry (the coolest last name any database guru could ever have) for helping me understand how UCS monitoring and alerting works.  He’s written some great slides, given great presentations, and has some other things in the works.