Another Reason for xCAT

Another testament of the power of xCAT was shown to me today.  We had a machine with an amber light on it, meaning:  “Something is wrong with this server”.  The system engineer came out and reseated everything.  Then they went through and replaced the entire system board thinking that would help.  When that didn’t solve it, they replaced the power supplies.  When that didn’t solve it, I finally said:  Ok, let me take a look.  This grasping for straws in the dark is quite frustrating when managing hardware.

I ran the xCAT reventlog command and cleared the hardware log.  Then we ran it again after the amber light turned back on.  WeI then got the following:

So I said:  Replace that fan.  They did.  Problem solved.  Then I looked at my syslog and found:

In other words, xCAT already had notified us that the there was a problem!

Most system management tools focus on deploying hardware and ignore the very real problem of managing your hardware as well.  xCAT does this on multiple levels and is extremely helpful for debugging hardware errors.  That SNMP trap had come as an SEL.  xCAT has built in functionality to decode IPMI events into real meaningful messages.  Like:  There’s a problem with the fan.

Had we turned to xCAT before the system engineers got on site, we would have saved 1-2 man days.  In addition, we would have saved a perfectly good planar board.  Hardware vendors gain a lot by using xCAT.  Those savings alone, had I not been busy on other things, or had someone else known the power of xCAT could have saved about $5,000.  This is just one case and there are others as well.

Our job at Sumavi is to make it so the power of xCAT can be easily packaged, harnessed, and digested by everyone easier.  Its a difficult task and one that we’re constantly working on, but we’re getting much better at it.

Comments are closed.