UCS Monitoring Part 1: Collecting and Analyzing UCSM Data

Whenever we discuss monitoring systems, we usually need to start by understanding what we mean by monitoring. Usually its two related definitions: Monitoring on one hand means looking at data, gaining visibility into what is happening on the system and being able to analyze it. Monitoring also means alerting: Let me know when something happens. You may then respond to the event in some way.

UCS can do both definitions of monitoring. And since monitoring has two parts, this blog will have two parts. In this part (part 1) we’ll examine how to look at UCS and understand what is happening in the system. The next post (part 2) will talk about how to be alerted.

Lets examine the data by answering one of the most common questions we run across with UCS: How many connections do you need from the Fabric Extenders (aka: FEX aka IO Module aka 2104/2204/2208) to the Fabric Interconnects. Mostly what I see is from 2 to 4 connections per FEX to Fabric Interconnect. But it would be great if you could determine how much bandwidth is actually being used to scientifically decide whether you need more or less cables. And it turns out you can free of charge with UCS Manager. Since we are trying to answer this question, we’ll be focusing on monitoring the network in UCS. Keep in mind, however, that you can also monitor the power consumption, temperature, and error statistics of many of the other components.

To answer this question it takes a little math and a little bit of poking around to figure it out. Steve McQuerry presented at Cisco Live session ID BRKCOM-2004 in San Diego earlier this year. My blog is based off some of his slides which you can get at http://ciscolive365.com (free login required), but my math is daringly original, so please let me know if I’ve made errors.

Let’s first look and see how UCS collects data. On UCS manager navigate to Admin, then filter by Stats Management. From here you will see the collection policies. By default each collection policy has a collection interval of 1 minute and a reporting interval of 15 minutes.

So what does that actually mean?

Collection Interval: How often the data will be collected. We are encouraged to change the collection interval to 30 seconds to get more granulated data. This means that every 30 seconds, the device will be queried by the UCSM subprocess responsible for gathering statistics from the underlying NXOS.

Reporting Interval: How often data will be stored to the UCS Manager. While we set the collection interval to 30 seconds, the reporting interval is how often it is stored in UCS Manager. So we might take our first interval at 9:11AM then the next would be at 9:26, and then every 15 minutes after that. UCS can only hold up to 5 of these records. That alone should tell you that UCS is not good for long term trend analysis. It is recommended that another monitoring solution be used for greater detail.

Cisco recommends that you change the collection interval to 30 seconds for the things you’re interested in. The reporting interval doesn’t really matter for what we’re doing here.

Examining FEX bandwidth

I have a first generation IOM so the traffic is not trunked from blade to Fabric Interconnect. It follows a defined path based on the number of uplinks. (see this great post: http://jeremywaldrop.wordpress.com/2010/06/30/cisco-ucs-ethernet-frame-flows/ for information on how its connected internally)

I have 2 chassis, each connected with 2 ports. Ports 1 & 2 connect to chassis 2 and Ports 3 & 4 connect to chassis 1. (Yes, this is not good form, but hey, I inherited this lab so that’s just the way it is and I haven’t bothered to fix it). To see how your chassis are connected to the Fabric Interconnect, click on the Equipment tab, select the Chassis, and then select Hybrid display from the work pane

That should tell you how the connections are made from FEX to Fabric Interconnect.

Now let’s now look at one of the FEX uplinks. Navigate to the Equipment tab, filter by Fabric Interconnects and look at the server ports that are connected to Fabric Interconnect A:

Select the first port and lets look at the statistics tab in the work pane:

To measure bandwidth, we are interested in the delta of total bytes received (Rx) and transmitted (Tx) on each of the FEX uplinks. This particular uplink for Received and Transmitted Total Bytes shows 837,101 and 691,921 bytes respectively.

We typically measure I/O in Gbps, Mbps, or Kbps. So we need to translate these numbers. This is where the math comes in. First, remember, that our collection interval is 30 seconds. That means, that number reported is x bytes in 30 seconds. To get bytes per second, just divide that number by 30. From there, do the type of multiplication you may have learned in your physics class when converting between different forms of measurements. Here’s the formula for Gbps and Mbps:

Bytes to Gbps from 30 second interval collection period

= x * 0.000000000248 Gbps

(x bytes / 30 seconds) * (8 bits / 1 byte ) * ( 1 Gb / 1,073,741,824 bits )

** Note: You could argue that there are only 1 million bits in a Gigabit, go ahead and use that if it makes you more comfortable.

Bytes to Mbps from 30 second interval collection period

Probably easier to do this in Mbps:

= x * 0.000000254 Mbps

(x bytes / 30 seconds) * (8 bits / 1 byte) * (1 Mb / 1,048,576 bits)

Just looking at those formulas (or multipliers as they really are), there are some simple rules we can follow:

Rule 1: If the delta is not a 10 digit number or greater then you are not even doing a Gigabit per second on a 10 Gigabit link.

Rule 2: If the delta is not a 7 digit number or greater then you are not even doing a Megabit per second on a 10 Gigabit link.

Armed with this knowledge, we do our math:

Rx: 837,101 * 0.000000254 = .212 Mbps = 212 kbps

Tx: 691,921 * 0.000000254 = .1757 Mbps = 175.7kbps

Not a lot going on in this link is there?

After looking at the rest of the links on the system they were all in the same 6 figure range with one exception: One link (Fabric B, port 1) had Rx at 13,082,674 and Tx at 3,241,484 which is about 1.5 Mbps and 823 kbps

Now, how can I find out what server is generating all that traffic? (Let’s just suppose that 1.5 Mbps is a lot for pedagogical purposes)

Examining Server vNIC bandwidth

Since I have 2 cables per FEX I know that Fabric B uplink 1 is connected to all the B-side uplinks on odd slots. (Remember this post?)

All the even slots are connected to the 2nd one. So this has to be either blade 1, 3, 5, or 7. So what I have to do is check which Service Profiles are in those slots. From the equipment tab I determine that I have:

Slot 1: ESXi-1000v-02 -> Slot 1

Slot 3: Empty

Slot 5: CIAC-ESXi4.1-02

Slot 7: Empty

I only have to check 2 servers. On each server I have assigned a LAN connectivity connection so I know which vNIC is going out the B side. From here its just a matter of finding the chatty one. Here’s how I found my Most Chatty Server Port (MCSP):

From the Servers tab, navigate to the service profile of the machine. I have 6 vNICs in each one:

Since I’ve labeled them, its pretty obvious which ones go out the B side. Click on each vNIC and from the work pane, select statistics. We expand the statistics and see a familiar screen. But this time, we look under vNIC stats:

After examining each of them I can see that the chatty interface is my NFSB vNIC. Its doing a lot of work! And accounts for most of the change in deltas. This is one of the reasons I recommend on UCS doing more than just the two default vNICs. You get to see in hardware what is happening. We found our most chatty server port and gained a lot of insight as to what this idle system is doing.

If you did not find any chatty activity in the vNICs it might be the Fibre Channel. Remember, we are doing FCoE from the Adapter to the Fabric Interconnects. Try checking the counters there.

Examining UCS Uplink Bandwidth

To finish off this post, lets look at the uplinks coming out of the Fabric Interconnect. This works differently if you have a Port-Channel or standard uplinks. For Port-Channel, you would go to the LAN tab, select the port-channel from the LAN cloud and then look at the statistics there.

If you do not have a port-channel configured, you can do it from the Equipment tab like we did before with the Server Ports (aka: FI to FEX ports). From the equipment tab, filter by Fabric Interconnect and select the uplink ports:

From here, look at the Rx and Tx total bytes delta to get an idea of how things are changing. Pretty simple right? Just look for greater than 10 digit deltas for hot spots.

Part 1 Summary

The purpose of this post was to help you understand what total network traffic looks like inside your UCS environment. There are 3 spots to consider when understanding traffic patterns: The server adapters, the FEX, and the uplinks. Knowing how to read the statistics and make sense of them can help you quickly find hot spots. The basic rule is that any delta in the Total Bytes Rx or Tx that has more than 10 is worth looking at and multiplying by 0.000000000248 to get the total Gbps.

It is worth pointing out that you can also select the ‘Chart’ option under any of the statistics tool to see a trend. When dealing with Rx and Tx deltas, you’ll have to modify the range of the scale otherwise it will seem that there is no data.

Lastly, for long term analysis a different tool is needed. UCSM only gives you a brief snapshot as there is not room to store it all in UCS Manager. Open source tools like Cacti, Nagios, Zenoss, and Zabbix can help do this. Solarwinds is also a popular commercial product that helps in performance tracking.

In my next post, I’ll talk about monitoring thresholds so that you can have UCS generate an alarm if network traffic gets too high.

Credits: Steve McQuerry, Craig Schaff, David Nguyen, and Dan Hanson. Thanks guys!