Category Archives: Monitoring

UCS Monitoring Part 2: Alerts for Bandwidth

This part 2 in my two part series on monitoring UCS.  Part one dealt with analyzing data and making sense of what UCS Manager already collects and displays for you.  This part will focus on alerting.  In particular, our objective is to give us a warning when the bandwidth utilization goes above 80% and a critical alert when bandwidth goes above 90%.

Once again I will be following the slides presented at Cisco Live by Steve McQuerry session ID BRKCOM-2004 in San Diego earlier this year.  You can get those too by visiting http://ciscolive365.com (login required).

First… some math

We will assume our links are simple 10GbE links.  If we hit 80% and 90% then we are looking to monitor when bandwidth hits 8Gbps and 9Gbps.  Easy math right?  But unfortunately UCS reports new bytes collected every 30 seconds.  Therefore, we need to convert Gbps into Bytes / 30 seconds and monitor for that number.

The math is still simple but the concept of converting units can be a little frustrating.  Here is how we do it:

x Gbps *  (30 seconds ) * (1,000,000,000 bits / 1Gb )  * (1 byte / 8 bits) ~ 3,750,000,000

or you could argue there are 1,073,741,824 bits per gigabit.  In which case you would have:

x Gbps * (30 seconds) * (1,073,741,824 bits / 1 Gb ) * ( 1 byte / 8 bits) ~ 4,026531840

I’ve seen it both ways and I’m not going to argue with it.  To be consistent with the previous post I’ll use 4,026,531,840 as my multiplier.  So multiply the expected Gbps by that number 4,026,531,840.

Here’s a table that takes common speeds that we’ll be interested and converts them:

Bytes/30 second multiplier 4,026,531,840
Speed in Gbps Bytes / 30 seconds
1 4,026,531,840
5 20,132,659,200
7.5 30,198,988,800
8 32,212,354,720
8.5 34,225,520,640
9 36,238,786,560
10 40,265,318,400
16 64,424,509,440
18 72,477,573,120

Creating Alerts

Now that we know what we are looking for, lets create some alerts.  There are 3 hotspots to consider in UCS:  The bits leaving the server adapter, the FEX to Fabric Interconnect, and the Fabric Interconnect to upstream switch.  Let’s start by looking at the server adapter.

Step 1: Create the Threshold Policies

From the LAN tab, filter by policies and navigate to Threshold Policies

Right click the Threshold Policies and select “Create Threshold Policy”.  We’re going to create a new Threshold Policy and call it 10Gb-Policy

Select ‘Next’ and add a Stat Class.  We’re going to add Vnic Stats:

 

The next screen is for creating our definitions.  We’re going to create 2 definitions:  1 for Rx Bytes Delta and 1 for Tx Bytes Delta.  We’ll create a major event (when network bandwidth hits 90% of 10Gbps) and a minor event (When network bandwidth hits 80% of 10Gbps).  We also need to put a value in for when the alarm will stop.  We can use 85% for the major alarm and 75% for the minor alarm.  This means if network bandwidth hits 80%, then we’ll trigger a warning and that minor alarm won’t go away until network bandwidth goes down to 75%.  Similarly, if network bandwidth hits 90% then we’ll trigger an alert and it won’t subside until network bandwidth utilization goes below 85% or 8.5Gbps in this case.

Using our table from above we now fill in the blanks for the Tx Delta:

 

We also need to do this for the Rx Delta after saving this off.  This should look identical to the Tx Delta with the Property Type being the only difference.  When we’re done we have a nice Threshold Policy:

Step 2:  Associate the Threshold Policy to a vNIC Template

Since we use LAN connectivity templates, we only need to modify our LAN connectivity templates on the nodes we are using to include our new 10Gb-Policy.  If you don’t, you’ll have to go modify every vNIC on every service profile.

From the LAN tab, filter by Policies open the VNIC Templates and select the VNIC Template you used on your virtual machines.  Change the Stats threshold policy to match the 10Gb-Policy we just created and save changes:

Do this for all VNICs Templates.  If you configured them as updating templates (hopefully) then you shouldn’t have to do anything else and they’ll all be monitored.

Step 3:  Repeat for Uplinks

From the LAN tab, filter by LAN Cloud.  Add to the default policy the same steps you did in step 1.  You should have etherRxStats and etherTxStats when you are done.  This will then be applied to the uplinks provided they are not port channels.  This applies to single links.  To deal with port channels, you simply click on the port channel and edit it there.

Step 4:  Repeat for FEX connections

From the LAN tab, filter by Internal LAN.  Add to the default policy (you won’t be able to create a new policy).  This will be the same values as you had in the previous step.

Good!  That was a lot of typing.  You are now ready to be alerted!

Testing Alerts

To see if this really works we used the iperf benchmark.  For the Windows operating system you can use the jperf benchmark.  In my lab I created 2 Red Hat Linux VMs named iperf1 and iperf2.  I then loaded them up on two different vSphere ESXi hosts.  I created an anti-affinity policy so that they would not be migrated to the same host.  The hosts were located at chassis 1 blade 1 and chassis 2 blade 1.  We made the traffic leave the Fabric Interconnects by tying one VM to the vNIC on the A side and the other VM tied to the VM on the B side.  This looks similar to the logical diagram below:

On iperf1 I ran:

[root@iperf1 ~]# iperf -s -f m

That is the server.  Then on the other host I ran:

[root@iperf2 ~]# while iperf -c 192.168.50.151; do true; done

It wasn’t long before we saw errors going all the way up through the stack:

Looks like our alerting works!

Conclusion

In this post we showed how to get alerts when bandwidth gets too high.  We used a constant of 4,026,531,840 to multiply with the desired Gigabits per second that we are interested in monitoring.  We created threshold policies on the NICs, the FEXs and the Fabric Interconnect uplinks.  We then tested to see that errors were generated all the way through when the bandwidth got to high.

Hopefully this helps you get a better idea of what is happening inside your UCS.  Now you can decide whether you really need all those uplinks or not.  If not, then you can use those ports for other things.

I want to mention here that we only focused on the Ethernet side of things. The Fibre Channel network follows a very similar process.  When troubleshooting suspected bandwidth issues, be sure to examine your fibre channel traffic as well.

Finally, I want to thank Steve McQuerry (the coolest last name any database guru could ever have) for helping me understand how UCS monitoring and alerting works.  He’s written some great slides, given great presentations, and has some other things in the works.

UCS Monitoring Part 1: Collecting and Analyzing UCSM Data

Whenever we discuss monitoring systems, we usually need to start by understanding what we mean by monitoring. Usually its two related definitions:  Monitoring on one hand means looking at data, gaining visibility into what is happening on the system and being able to analyze it.  Monitoring also means alerting:  Let me know when something happens.  You may then respond to the event in some way.

UCS can do both definitions of monitoring.  And since monitoring has two parts, this blog will have two parts.  In this part (part 1) we’ll examine how to look at UCS and understand what is happening in the system.  The next post (part 2) will talk about how to be alerted.

Lets examine the data by answering one of the most common questions we run across with UCS:  How many connections do you need from the Fabric Extenders (aka: FEX aka IO Module aka 2104/2204/2208) to the Fabric Interconnects.  Mostly what I see is from 2 to 4 connections per FEX to Fabric Interconnect.  But it would be great if you could determine how much bandwidth is actually being used to scientifically decide whether you need more or less cables.  And it turns out you can free of charge with UCS Manager.  Since we are trying to answer this question, we’ll be focusing on monitoring the network in UCS.  Keep in mind, however, that you can also monitor the power consumption, temperature, and error statistics of many of the other components.

To answer this question it takes a little math and a little bit of poking around to figure it out.  Steve McQuerry presented at Cisco Live session ID BRKCOM-2004 in San Diego earlier this year.  My blog is based off some of his slides which you can get at http://ciscolive365.com (free login required), but my math is daringly original, so please let me know if I’ve made errors.

Let’s first look and see how UCS collects data.  On UCS manager navigate to Admin, then filter by Stats Management.  From here you will see the collection policies.  By default each collection policy has a collection interval of 1 minute and a reporting interval of 15 minutes.

So what does that actually mean?

Collection Interval:  How often the data will be collected.  We are encouraged to change the collection interval to 30 seconds to get more granulated data.  This means that every 30 seconds, the device will be queried by the UCSM subprocess responsible for gathering statistics from the underlying NXOS.

Reporting Interval: How often data will be stored to the UCS Manager.  While we set the collection interval to 30 seconds, the reporting interval is how often it is stored in UCS Manager.  So we might take our first interval at 9:11AM then the next would be at 9:26, and then every 15 minutes after that.  UCS can only hold up to 5 of these records.  That alone should tell you that UCS is not good for long term trend analysis.  It is recommended that another monitoring solution be used for greater detail.

Cisco recommends that you change the collection interval to 30 seconds for the things you’re interested in.  The reporting interval doesn’t really matter for what we’re doing here.

Examining FEX bandwidth

I have a first generation IOM so the traffic is not trunked from blade to Fabric Interconnect.  It follows a defined path based on the number of uplinks.  (see this great post: http://jeremywaldrop.wordpress.com/2010/06/30/cisco-ucs-ethernet-frame-flows/ for information on how its connected internally)

I have 2 chassis, each connected with 2 ports.  Ports 1 & 2 connect to chassis 2 and Ports 3 & 4 connect to chassis 1.  (Yes, this is not good form, but hey, I inherited this lab so that’s just the way it is and I haven’t bothered to fix it).  To see how your chassis are connected to the Fabric Interconnect, click on the Equipment tab, select the Chassis, and then select Hybrid display from the work pane

That should tell you how the connections are made from FEX to Fabric Interconnect.

Now let’s now look at one of the FEX uplinks.  Navigate to the Equipment tab, filter by Fabric Interconnects and look at the server ports that are connected to Fabric Interconnect A:

Select the first port and lets look at the statistics tab in the work pane:

To measure bandwidth, we are interested in the delta of total bytes received (Rx) and transmitted (Tx) on each of the FEX uplinks. This particular uplink for Received and Transmitted Total Bytes shows 837,101  and 691,921 bytes respectively.

We typically measure I/O in Gbps, Mbps, or Kbps.  So we need to translate these numbers.  This is where the math comes in.  First, remember, that our collection interval is 30 seconds.  That means, that number reported is x bytes in 30 seconds.  To get bytes per second, just divide that number by 30.  From there, do the type of multiplication you may have learned in your physics class when converting between different forms of measurements.  Here’s the formula for Gbps and Mbps:

Bytes to Gbps from 30 second interval collection period

= x * 0.000000000248 Gbps

(x bytes / 30 seconds) * (8 bits / 1 byte ) * ( 1 Gb / 1,073,741,824 bits )

** Note:  You could argue that there are only 1 million bits in a Gigabit, go ahead and use that if it makes you more comfortable.

Bytes to Mbps from 30 second interval collection period

Probably easier to do this in Mbps:

= x * 0.000000254 Mbps

(x bytes / 30 seconds) * (8 bits / 1 byte) * (1 Mb / 1,048,576 bits)

Just looking at those formulas (or multipliers as they really are), there are some simple rules we can follow:

Rule 1:  If the delta is not a 10 digit number or greater then you are not even doing a Gigabit per second on a 10 Gigabit link.

Rule 2:  If the delta is not a 7 digit number or greater then you are not even doing a Megabit per second on a 10 Gigabit link.

Armed with this knowledge, we do our math:

Rx: 837,101 * 0.000000254 = .212 Mbps = 212 kbps

Tx: 691,921 * 0.000000254 =  .1757 Mbps = 175.7kbps

Not a lot going on in this link is there?

After looking at the rest of the links on the system they were all in the same 6 figure range with one exception:  One link (Fabric B, port 1) had Rx at 13,082,674 and Tx at 3,241,484 which is about 1.5 Mbps and 823 kbps

Now, how can I find out what server is generating all that traffic?  (Let’s just suppose that 1.5 Mbps is a lot for pedagogical purposes)

Examining Server vNIC bandwidth

Since I have 2 cables per FEX I know that Fabric B uplink 1 is connected to all the B-side uplinks on odd slots.  (Remember   this post?)

All the even slots are connected to the 2nd one.  So this has to be either blade 1, 3, 5, or 7.  So what I have to do is check which Service Profiles are in those slots.  From the equipment tab I determine that I have:

Slot 1: ESXi-1000v-02 -> Slot 1

Slot 3: Empty

Slot 5: CIAC-ESXi4.1-02

Slot 7: Empty

I only have to check 2 servers.  On each server I have assigned a LAN connectivity connection so I know which vNIC is going out the B side.  From here its just a matter of finding the chatty one.  Here’s how I found my Most Chatty Server Port (MCSP):

From the Servers tab, navigate to the service profile of the machine.  I have 6 vNICs in each one:

Since I’ve labeled them, its pretty obvious which ones go out the B side.  Click on each vNIC and from the work pane, select statistics.  We expand the statistics and see a familiar screen.  But this time, we look under vNIC stats:

 

After examining each of them I can see that the chatty interface is my NFSB vNIC.  Its doing a lot of work! And accounts for most of the change in deltas.  This is one of the reasons I recommend on UCS doing more than just the two default vNICs.  You get to see in hardware what is happening.  We found our most chatty server port and gained a lot of insight as to what this idle system is doing.

If you did not find any chatty activity in the vNICs it might be the Fibre Channel.  Remember, we are doing FCoE from the Adapter to the Fabric Interconnects.  Try checking the counters there.

Examining UCS Uplink Bandwidth

To finish off this post, lets look at the uplinks coming out of the Fabric Interconnect.  This works differently if you have a Port-Channel or standard uplinks.  For Port-Channel, you would go to the LAN tab, select the port-channel from the LAN cloud and then look at the statistics there.

If you do not have a port-channel configured, you can do it from the Equipment tab like we did before with the Server Ports (aka: FI to FEX ports).  From the equipment tab, filter by Fabric Interconnect and select the uplink ports:

From here, look at the Rx and Tx total bytes delta to get an idea of how things are changing.  Pretty simple right?  Just look for greater than 10 digit deltas for hot spots.

Part 1 Summary

The purpose of this post was to help you understand what total network traffic looks like inside your UCS environment.  There are 3 spots to consider when understanding traffic patterns:  The server adapters, the FEX, and the uplinks.  Knowing how to read the statistics and make sense of them can help you quickly find hot spots.  The basic rule is that any delta in the Total Bytes Rx or Tx that has more than 10 is worth looking at and multiplying by 0.000000000248 to get the total Gbps.

It is worth pointing out that you can also select the ‘Chart’ option under any of the statistics tool to see a trend.  When dealing with Rx and Tx deltas, you’ll have to modify the range of the scale otherwise it will seem that there is no data.

Lastly, for long term analysis a different tool is needed.  UCSM only gives you a brief snapshot as there is not room to store it all in UCS Manager.  Open source tools like Cacti, Nagios, Zenoss, and Zabbix can help do this.  Solarwinds is also a popular commercial product that helps in performance tracking.

In my next post, I’ll talk about monitoring thresholds so that you can have UCS generate an alarm if network traffic gets too high.

Credits:  Steve McQuerry, Craig Schaff, David Nguyen, and Dan Hanson.  Thanks guys!