Category Archives: xCAT

UCS emulator on ESXi

“Recently we put the UCS emulator on our ESXi server to kick the tires and check out the API.  The Emulator can be downloaded from Cisco.  (You need to register to download it).

The emulator is to be used with VMware Player out of the box.  That was fine, but we wanted to take a look at it together so we used xCAT to provision a stateless ESXi server.  Now you may happen to have VMware vCenter running in which case it would be very easy for you to import the image to run on ESXi simply by importing it in with the GUI.  But we are a bit more fearless in that respect, not to mention short on hardware.  So we just provisioned our ESXi server and didn’t bother installing a license nor adding it to vCenter.  This was just for a quick test, so no I don’t feel unethical about it.

But the problem is:  How to import the file?  And how to tell what was going on?

No problem.  We just used some ESXi vim-cmd-fu to make it happen.  Here’s what we did:

1.  Make a new vmfs for VMs to live on.

We skipped this step by just using xCAT to provision another VM on the ESXi host.  We used ‘datastore1′ as the storage and easily made it.

2.  scp the untarred file to the esxi server:

scp -r UCSPE/ node001:/vmfs/volumes/datastore1/

3.  Copy make the disk files work under ESXi

There are three disks in the emulator.  We just copied them:

cd /vmfs/volumes/datastore1/UCSPE/
vmkfstools -i UCSPEa-s001.vmdk -d zeroedthick UCSPEa-s0011.vmdk
vmkfstools -i UCSPEb-s001.vmdk -d zeroedthick UCSPEb-s0011.vmdk
vmkfstools -i UCSPEc-s001.vmdk -d zeroedthick UCSPEc-s0011.vmdk

4. Change the UCSPE.vmx file

We need to point to the new disks now. Theere were three lines we had to change:

scsi0:1.present = "TRUE"
scsi0:1.fileName = "UCSPEa-s0011.vmdk"
scsi0:2.present = "TRUE"
scsi0:2.fileName = "UCSPEb-s0011.vmdk"
scsi0:3.present = "TRUE"
scsi0:3.fileName = "UCSPEc-s0011.vmdk"

You also need to change the network.  By default, we have the “VM Network” that we can communicate on:

ethernet0.networkName = "VM Network"

For good measure, add the VNC port so you can watch what’s going on:

remotedisplay.vnc.enabled = "TRUE"
remotedisplay.vnc.port = "5900"

And that should be it for the *vmx file.

5.  Power it on

This is the easy part:

Register the VM:

vim-cmd solo/registervm /vmfs/volumes/datastore1/UCSPE/UCSPE.vmx UCSPE

Check the ID:

vim-cmd vmsvc/getallvms

My id for this machine is 16.  So lets power it on:

vim-cmd vmsvc/power.on 16

If there are problems, (like failures) then you usually have to troll through the /var/log/messages file.  Hopefully I didn’t forget anything and this just worked for you!

Now that you have it up, you should be able to watch it on VNC and then eventually log in using the userid: cliuser and password: cliuser.  Since we had a DHCP server, we watched the logs to see what IP address it came up with.  Then we grabbed the mac address and put it in our DHCP config statically (using xCAT of course!)

I’m very happy Cisco made this available.  In the past while doing testing on IBM or HP blades, we had to actually have live hardware to test on.  With the UCS Manager outside and on a VM, we can do a lot of testing with no hardware at all.  Then once the real stuff comes:  Look out!  We’re ready for it!

Dell IPMI issues with xCAT

One of our customers had some Dell R410 machines that were humming along just nicely.  One day something happened and all of the sudden xCAT rpower stopped working.  Was it that we updated the firmware?  Was it new xCAT code?  We couldn’t figure it out.

Our first epiphany came when we realized that ipmitool worked just fine with lan (IPMI 1.5) and lanplus (IPMI2.0).  ipmitool works?  Why didn’t xCAT?  It turns out that there was a problem authenticating.  In fact, if we used lanplus, we didn’t have to even enter the correct password and we could turn machines off and on!

# ipmitool -I lanplus -U user1 -P thispasswordisbogus -H node001-drac -C 0 mc info

Wow!  That’s a major security violation.  We alerted Dell.  But that still didn’t get xCAT’s rpower working.  As a temp solution, we modified /opt/xcat/lib/perl/xCAT/ to not try IPMI 2.0 and instead use only IPMI 1.5.  This worked fine for some things, but the great necessities of rcons, reventlog, and rinv would have taken more time to get working… and after all this was a temporary patch right?

So today I woke up determined to resolve the issue once and for all.  Working with Dell support (who were very nice and eager to help us) I figured out that there was a an IPMI Encryption key that was set to some random 40 character hexadecimal string.  How it was set, I still don’t know.  Viewing it in the iDRAC looked like this:

We instead cleared that key and set it to ’00′  (It was required to be an even number of hexadecimal characters).  Doing this solved out IPMI issue.  xCAT rpower then worked without a problem.  Dell then gave us a way to run this via the command line:

racadm  -r drac-comp036 -u user1 -p asdfasdf config -g cfgIpmiLan -o cfgIpmiEncryptionKey 0000000000000000000000000000000000000000000000

With that, order was restored.

This also shows one great thing about xCAT:  The IPMI packets coming back were not authenticating correctly.  There was a problem with the way the challenges were coming back.  ipmitool seemed to be very forgiving about that and not care.  xCAT didn’t like it at all and would not let it pass.  We view this as an ipmitool bug than an xCAT bug.  Wouldn’t you rather know about a potential security problem?

Debugging Syncfiles in xCAT

#!/usr/bin/awk -f
#      server = “openssl s_client -quiet -connect ” ENVIRON["XCATSERVER"] ” 2> /dev/null”
#  } else {
#      server = “/inet/tcp/0/″
#  }
server = “openssl s_client -quiet -connect “
quit = “no”
exitcode = 1
print “<xcatrequest>” |& server
print ”   <command>syncfiles</command>” |& server
print “</xcatrequest>” |& server
while (server |& getline) {
if (match($0,”<serverdone>”)) {
quit = “yes”
if (match($0,”<errorcode>”) || match($0,”<error>”)) {
exitcode = 0
if (match($0,”</xcatresponse>”) && match(quit,”yes”)) {
exit exitcode

#!/usr/bin/awk -fBEGIN {#  if (ENVIRON["USEOPENSSLFORXCAT"]) {#      server = “openssl s_client -quiet -connect ” ENVIRON["XCATSERVER"] ” 2> /dev/null”#  } else {#      server = “/inet/tcp/0/″#  }
server = “openssl s_client -quiet -connect ”
quit = “no”  exitcode = 1
print “<xcatrequest>” |& server  print ”   <command>syncfiles</command>” |& server  print “</xcatrequest>” |& server
while (server |& getline) {    if (match($0,”<serverdone>”)) {      quit = “yes”    }    if (match($0,”<errorcode>”) || match($0,”<error>”)) {      exitcode = 0    }
if (match($0,”</xcatresponse>”) && match(quit,”yes”)) {      close(server)      exit exitcode    }  }}I’ve had issues lately where syncfiles hangs on post install forever.  In fact it even kills the installer.  To fix it, I usually just do:

nodeset <node> boot

Then run:

rpower <node> boot

Next, you log into the node when it boots and create this little script (borrowed from /xcatpost/startsyncfiles.awk)

#!/usr/bin/awk -f
#      server = "openssl s_client -quiet -connect " ENVIRON["XCATSERVER"] " 2> /dev/null"
#  } else {
#      server = "/inet/tcp/0/"
#  }

  server = "openssl s_client -quiet -connect "

  quit = "no"
  exitcode = 1

  print "<xcatrequest>" |& server
  print "   <command>syncfiles</command>" |& server
  print "</xcatrequest>" |& server

  while (server |& getline) {
    if (match($0,"<serverdone>")) {
      quit = "yes"
    if (match($0,"<errorcode>") || match($0,"<error>")) {
      exitcode = 0

    if (match($0,"</xcatresponse>") && match(quit,"yes")) {
      exit exitcode

Then you just run that code and it will start syncfiles the way the postscript does.  The thing to watch for here is that IP address I put in.  You’ll have to put your management server’s IP address there.  You can also sub other commands instead of syncfiles here as well.  Like xcatlog etc.

One issue I noticed while doing this is that even though updatenode -F worked, and if I installed the node with rinstall <node> -o rhels5.5 -a x86_64 -p foo then it would get the correct syncfiles.  However, if I put that image foo inside the osimage table and ran the install, it wouldn’t get the syncfile right unless I added the syncfile to the osimage table.

Another Reason for xCAT

Another testament of the power of xCAT was shown to me today.  We had a machine with an amber light on it, meaning:  ”Something is wrong with this server”.  The system engineer came out and reseated everything.  Then they went through and replaced the entire system board thinking that would help.  When that didn’t solve it, they replaced the power supplies.  When that didn’t solve it, I finally said:  Ok, let me take a look.  This grasping for straws in the dark is quite frustrating when managing hardware.

I ran the xCAT reventlog command and cleared the hardware log.  Then we ran it again after the amber light turned back on.  WeI then got the following:

# reventlog n316
n316: 04/11/2011 06:57:59 Event Logging Disabled, Log Area Reset/Cleared (SEL Fullness)
n316: 04/11/2011 06:58:05 System Firmware Progress, Unspecified (Progress)
n316: 04/11/2011 07:01:02 Fan, Lower Critical - going low (Fan 3B Tach reading 0 RPM with threshold 1872 RPM)

So I said:  Replace that fan.  They did.  Problem solved.  Then I looked at my syslog and found:

Apr 11 07:01:19 xcat1 xCATMon Event: SNMP CRITICAL received from hs316-imm(UDP: []:623). CRITICAL: Fan, Lower Critical - going low (Sensor 0x45)

In other words, xCAT already had notified us that the there was a problem!

Most system management tools focus on deploying hardware and ignore the very real problem of managing your hardware as well.  xCAT does this on multiple levels and is extremely helpful for debugging hardware errors.  That SNMP trap had come as an SEL.  xCAT has built in functionality to decode IPMI events into real meaningful messages.  Like:  There’s a problem with the fan.

Had we turned to xCAT before the system engineers got on site, we would have saved 1-2 man days.  In addition, we would have saved a perfectly good planar board.  Hardware vendors gain a lot by using xCAT.  Those savings alone, had I not been busy on other things, or had someone else known the power of xCAT could have saved about $5,000.  This is just one case and there are others as well.

Our job at Sumavi is to make it so the power of xCAT can be easily packaged, harnessed, and digested by everyone easier.  Its a difficult task and one that we’re constantly working on, but we’re getting much better at it.

The xCAT 2010 Year in Review

DISCLAIMER:  I’m not the official gatekeeper of xCAT, and I don’t work for IBM. But as a large contributor to the xCAT project, as well as a user, I figured I had just as much authorization to write about what happened to xCAT in 2010 as anyone else does.  And the best part is:  I’m not tied to a large corporation that monitors what I can or can not say.  So I can say, and will say anything I want to :-).  So anyway, what I say here represents my opinions and views and don’t represent the opinions or views of the priests at IBM nor of the greater xCAT community.  So tell your lawyers to go away.

So If nothing else, this is to give you something to read about while you’re recovering from your new years eve party.  I present:  The xCAT 2010 Year in review:

To start off with, I just want to thank all the users and developers that I’ve been able to work with this year. It’s been nothing short of incredibly amazing.  We’ve been working on tough problems.  The xCAT user base has some of the most talented, passionate, and dedicated people in the industry.  (And in most cases are smarter than us developers)  It was great to work with all of them this year.  I am thankful for your insites and criticism.  We hope people become more critical of xCAT and don’t sugar coat anything.  I didn’t work with one person this year on xCAT problems that I didn’t walk away from thinking: “Hmm, that <dude|lady> is pretty smart”.  So the talent inside the xCAT users group is fantastic.  You can tell that by the types of comments that come into the mailing list.  They’re much different than what you might see on other open source projects.  So the biggest story of 2010 for xCAT:  The users moved mountains, and helped make xCAT better.

Enough flattery.  As far as releases, xCAT went from the 2.3.4 release in March and culminated with 2.5.1 that was released Dec 11th.  Its nice to see the releases coming so rapidly.  I don’t anticipate this will be the case in 2011.  I think we’ll see slower major release cycles and minor release cycles will just coincide with OS updates (VMware’s updates, RHEL6 stuff, SLES, etc.)  Most of the reason I think things may slow down is because xCAT is pretty feature complete.  It can do lots of things.  It could use lots of clean up work, but nobody seems to be interested in doing that.  Most developers I think just accept that no error messages are printed when XYZ happens, and look at it as a learning opportunity for you to become acquainted with the xCAT code.  That’s too bad in my opinion.

Technology Highlights

There lots of features added to xCAT that made its awesomeness ooze even more than ever.  As usual, I usually skip all the awesomeness that was added to xCAT’s IBM’s AIX and SystemP functionality, since I don’t ever work on that aspect of xCAT, and usually just complain about the table pollution the AIX tables cause. (You know who you are: ppc, ppcdirect, pchcp, nimimage, etc) However, from what little I do know they’ve been pretty busy with it this year and I am glad to work with them on some small scale on this great product.

VMware Support – VMware ESX support has been in xCAT for a long time.  But this year we added better stateless ESXi support (The first ever) and also ESXi kickstart support with ESXi 4.1 (I think we’re the only project that has that).  We also added cloning, thin or thick, and lots of other cool features to support VMware virtual machines.

On the KVM front, we did the same if not better.  So if you’re running VMware or KVM on your nodes, xCAT doesn’t care, it looks the same.

imgimport/imgexport – We started the ability to import or export images (stateless and stateful) into xCAT.   This is still not as mainstream as I’d like it to be, but we hope to add more to this in 2011.  I still dream of supporting some type of physical machine image store.

Dynamic DNS – Dynamic DNS was added to xCAT and I don’t think used as often.  But it assigns discovered nodes static IP addresses so that  you don’t need to predefine them in /etc/hosts.  This is great and a step further to making automagic discovery even more magical.

Statelite – xCAT has had stateless (RAM root in tmpfs ) support since 2005.  It’s also had a type of hybrid with NFS.  But now we’ve made statelite:  This allows stateless, with NFS root, with a real hard drive for statefull files.  Its wild.  Give it a try.  It probably offers the most extreme way to manage a system.  The coolness of the hierarchical tree support is frightening.

Sumavisor – This isn’t xCAT proper, but a nice Web Interface to xCAT has finally been added by Sumavi.  (More on that in the next section)

Non Technological Highlights

I think a big highlight is the emergence of a company that is dedicated to bringing xCAT to the masses and is not afraid to invest in it.  This is my company, Sumavi , founded Feb 2010 :-).  We have done very well this year with service engagements with some big time accounts.  In addition we’ve made some great partners, turned out some solid documentation and made a really nice GUI front end to xCAT that we’ve branded ‘The Sumavisor’.  And its not just a GUI.  It does much more than that, enhancing xCAT to make it look polished and add control and insight that you can’t get on the command line.  It also does a lot to cut the learning curve down and adds commercial support to all xCAT installations regardless of what hardware they are running on.

As far as the community goes, I had a rant on xCAT’s documentation problem.  Others at IBM have attempted to make the xCAT documentation more usable.  You can see it here.  It’s definitely better than it was and is a step in the right direction, but I feel like the Sumavi documentation is much more usable.  But you tell me.

In other news, we rolled out xCAT in banks, credit card companies, and the usual blend of government and university accounts.  But the most exciting is to see xCAT venture into corporations that are not focused on HPC.  This year we installed the Lego Universe MMOG environment using xCAT.  It was an all Windows environment too!  We did some cool stuff all over the place.  Even while not at IBM I was able to talk to developers at IBM and from all over the world.

In addition, we really broke out and started supporting hardware from Dell, HP, and other whiteboxes.  I even started developing the Cisco UCS xCAT module.  I haven’t finished it yet… I’ll wait for Cisco to cough up some support dollars for that.  You hearing me Cisco?

Predictions for 2011

2011 I hope we will continue to see xCAT do more outside of HPC. I hope to get into more cloud deployments.  We’ve already done a bunch.  But we’d like to have more packaged products.  I see us coming out with EC2 support. (not based on anything but Amazon APIs (sorry, but we heard too many complaints about the open source version of Eucalyptus)), I also see more appliance based models, like Hadoops.

As more people want to drive xCAT, ad Web Services API is in the works.  Right now you can perform xCAT calls via XML messages to port 3001 with xCAT, but this needs to be more robust.  We’ve done that with the Sumavisor, but there is more that needs to be done.  Hopefully that will be out in the first half of the year.

And finally, the biggest thing about cloud is that its all about the applications.  How will we deal with making applications more agile?  I see this as a major focus for our group.  Creating virtual machines, etc is great, but how do we help, or is it even our role to help with the creation of the contents of those machines?  We seem to be in that world already.  But where or should we draw the line between xCAT and things like RightScale.  Where or should we draw the line between xCAT and Chef, Puppet, etc?

I don’t know yet and I can’t wait to find out.

Anyway, I hope you had a wonderful 2010 and I hope 2011 is just as wild for you!

ESXi 4.1 and HP BL460c G6 with Mezzenine card

Had an issue where I would install CentOS on these HP blades and I would be able to see 16 nics.  But when I installed ESXi 4.1 I only saw 8 nics.  16 is the right number because each flexNIC has 4 vNics.  So with 4 of these, I wanted to see some serious bandwidth.  After fumbling around we finally came to the conclusion that the be2net driver was not loaded on the hypervisor.

My Mezzanine card is a HP NC550m Dual Port Flex-10 10GbE BL-c Adapter.  My HP rep said that these were not going to be supported by HP on ESXi 4.1 until November and that I could drop back to 4.0 or he could try to get me some beta code.

I found that you can just download the driver here.  I tried a similar route by installing the hp-esxi4.1uX-bundle from HPs website but that just gave me stuff I didn’t need (like iLo drivers).

The link above is an ISO image.  The easiest way for me to install it on a running machine was to open the ISO on a linux machine and then copy the files to the ESX hosts:

# mkdir foo
# mount vmware-esx-drivers-net-be2net_400.2.102.440.0-1vm* foo -o loop
# cd foo/offline-bundle
# scp vhost001:/

Then you just need to install it.  The only problem with this is that it involves a entering maintenance mode and then a reboot.  Is this windows xp or something?  We’re just talking about a driver here…

Anyway, SSH to the ESXi 4.1 (or use VUG if you want to pay $500 bucks instead).  Since I use xCAT, I have passwordless SSH set up:

# vim-cmd hostsvc/maintenance_mode_enter
# esxupdate update --bundle /
# vim-cmd hostsvc/maintenance_mode_exit
# reboot; exit

After the node reboots you can run:

esxcfg-nics -l

you’ll be able to see all 16 nics.

Hope that saves you time as it took me a while to figure this out…

My next post will talk about how to integrate this into the kickstart file so you don’t have to do any after-the-install junk.

Using the HP Array Configuration Utility CLI for Linux

This week I took part in an installation where we got a large amount of HP BL460c G6 blades sent to us.  One of the daunting tasks was to configure the RAID.  The normal thing I see is people waiting for the BIOS to pop up, press F8 or some other trickery of keystrokes to finally get to the RAID menu and configure it.  I’m cool doing this one time.  I might even do this two times.  But at some point a man has got to define a limit to doing mundane repetitive tasks that are better done by computers.

A good guy I know is a dude named Johnny.  He pointed me to this link of the hpacucli.  I still don’t know how he found the link.  His google-foo is better than mine I suppose.

This program can be installed on a Linux machine and then the RAID can be configured.  But you’re telling me:  ‘Chicken and Egg problem!’ How do you run a program on the OS to configure the RAID when you need an OS installed on the RAID to run the program?  Simple:  You netboot the machines with a stateless image so that the OS is in memory and doesn’t require hard drives.  Too bad for you that you probably don’t have xCAT.  Cause I do, and I use it without reservation.  And since I have it, it took me 5 minutes to create a stateless image that booted up on the servers. (I’ll tell you how to do that at the end of this little writeup).

Once the machine booted up I ran the command to get the status:

# hpacucli ctrl slot=0 logicaldrive all show status

Probably nothing happening since I haven’t done anything yet.  So I took at a look at the physical drives:

# hpacucli ctrl slot=0 pd all show              

Smart Array P410i in Slot 0 (Embedded)


 physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 146 GB, OK)
 physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 146 GB, OK)

So then I just made a RAID1 on those disks:

# hpacucli ctrl slot=0 create type=ld drives=1I:1:1,1I:1:2 raid=1

I rebooted the blade into ESXi4.1 kickstart and all was bliss.  But then I got even more gnarley.  I didn’t want to log into each blade and run that command.  So   I used xCAT’s psh to update them all:

# psh vhost004-vhost048 'hpacucli ctrl slot=0 create type=ld drives=1I:1:1,1I:1:2 raid=1'

Boom!  Instant RAID.  Now check them:

# psh vhost003-vhost048 'hpacucli ctrl slot=0 logicaldrive all show status'
vhost003:    logicaldrive 1 (136.7 GB, RAID 1): OK
vhost005:    logicaldrive 1 (136.7 GB, RAID 1): OK
vhost004:    logicaldrive 1 (136.7 GB, RAID 1): OK

I’ve used this technique with IBM blades in the past as well.  Now all my blades are installed with ESXi 4.1 and I didn’t have to wait through any nasty BIOS boot up menus.  I’ve also automated this in the past by sticking this script in the

xCAT image creation for HP 460c G6

This is fairly easy.  First, create or modify the /opt/xcat/share/xcat/netboot/centos/compute.pkglist so that it looks like this:


Next, run ‘genimage’.  The trick with the HP Blades is to add the ‘bnx2x’ driver.  Once you’re done with this, install the hpacucli RPM in the stateless image:

# rpm -Uivh hpacucli-8.60-8.0.noarch.rpm -r /install/netboot/centos5.5/x86_64/compute/rootimg

Once this is done, run:

# packimage -p compute -a x86_64 -o centos5.5

Then a simple:

# nodeset vhost001 netboot=centos5.5-x86_64-compute
# rpower vhost001

will install the nodes to this image.  That’s it, then you can run the commands above to get the RAID set up.

Bonus points:  Then install ESXi 4.1 with xCAT.

ImageX Windows 2008 with vCenter Server

With xCAT I used the imagex capabilities to clone a machine (virtual machine) with vCenter on it and now I’m installing that captured image to another virtual machine.  One reason I do it this way as opposed to creating a VM Template is that now I’m able to deploy to physical and virtual servers.  In addition I can deploy to KVM VMs and VMware VMs.

Anyway, when I restarted the cloned disk, I couldn’t start vCenter Server on it. This is something that I assume happens with all windows images that you run sysprep on regardless or not whether its xCAT induced.

There were a lot of error messages as I trolled through the logs and I searched on all of them:

“Windows could not start the VMware VirtualCenter Server on Local Computer.  For more information, review the System Event Log.”

“the system cannot find the path specified. c:\Program Files (x86)\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\master.mdf” (odd to me, since the file did actually exist!)

“could not start the SQL Server (SQLEXP_VIM) on Local Computer”

I tried all these google terms with little success:

“Virtual Center will not start after reboot”

Finally, I found this post: that finally made sense to me, so I gave it a shot.  Here’s exactly what needs to be done:

Step 1

Log into Windows Server 2008 machine and go to ‘Services’

Step 2

Right click on ‘SQL Server (SQLEXP_VIM)’  and click on Properties

Step 3

Under ‘Log On’ check the Local System account and also check ‘Allow service to interact with desktop’

Step 4

Start the ‘SQL Server (SQLEXP_VIM)’ service.  It should start up without errors.

Step 5

Do the same procedure (Steps 2-4) to the ‘VMware VirtualCenter Server’ service.  This should now start up and you should be able to connect to Virtual Center.

Well, that wasted about 4 hours of my time, but its nice to have a happy ending.

State of xCAT on HP Blades

I had the opportunity this week to test drive xCAT on HP blades. I had a c7000 chassis with some spiffy BL460c G6s. The configuration is very straight forward. We’ve updated the xCAT Install Guide to include how to configure the blades and I think we’ll be doing a lot more.
Currently on these blades the following seems to work well:

  • getmacs
  • rinv

rpower works but there are some glitches where it doesn’t return status correctly.  We’ll be fixing that to make sure it does.  rpower <noderange> boot (which we rely on a lot) is non functional.  (Mostly I think because rpower off and on don’t work all the time as expected.)

rvitals is not set up either.

Its been good to see how xCAT is able to function on many vendors platforms.  I think this is one of the things that makes it uniquely positioned among data center management solutions is that it is able to excel in heterogeneous environment.  I hope this also dispels any myths that xCAT is an IBM product.  While its legacy is IBM, it has evolved into an open source project that can be used by many organizations desiring data center management without vendor hardware lock-in.

The xCAT Documentation Problem

Ever since 2001 I remember hearing the same thing about xCAT: “The documentation needs improvement”, “There is a steep learning curve”. Back in those days the quick xCAT-mini-howto was a great way to get going. The problem is that it left out a lot of the important details like service nodes, offloading tftp servers, etc. For most questions you really just had to read the code to figure out the solution.  But it was a great way to start.

Later there came a redbook and that was great but now its far too outdated to be helpful. For the most part, we kept writing a bunch of mini howtos for individual implementations. Looking back this is something we should have taken more serious.  I think the project would be much farther along had we done something.

When xCAT 2 was created in 2007 there was a discussion about what to do about the documentation. The decision (which I don’t agree with) was to write the docs in OpenOffice and then create PDFs out of them. Others (me) thought it would be nice just to create a general wiki but the other camp didn’t like that you couldn’t make PDFs out of them.  (Yes, I know you can, but its not so obvious on SourceForge how to do it). So the xCAT documentation was created in Open Office and distributed with the xCAT rpms. You can see it all here. In addition, Another IBM redbook was written (and then quickly outdated) and we in the field tried updating the xCAT wiki but the camp was divided and this probably caused more confusion than good.

So now we have a problem. Its the problem we’ve always had.  There is a lot of documentation, but its all over the place. There is no single place to get all the information.  In fact, we even made a document to help you find the right document.  Its called the  ‘Top Document’.  I’m sorry, but that is just pathetic.

For any configurable software to really succeed there needs to be easy to follow, well thought-out, quality documentation system. I think many developers feel that documentation is beneath them: “The documentation is in the code”. They are too busy adding new features to worry about the common mundane problems of telling people what they did.  This kind of thinking limits the install base.  I am frankly more concerned that nobody has taken this very seriously before.  It is the number one thing I hear from xCAT users:  Make a good doc!

So I started thinking about what kind of documentation I would like for xCAT. I started writing a book and thought about publishing it but I could never find the time to finish it. In addition things changed quickly and I didn’t make it a priority.  In addition, how would I distribute it?  How would I update it since it would be obsoleted so quickly?  So I made a list of requirements:
- It must be maintainable very quickly. Wiki’s are great because people can edit them right away. Firing up an Open Office doc, editing, then converting to PDF, then checking back in is too archaic and takes too much time. An agile development environment needs to have an agile documentation environment. So any docs are out.  Some wiki’s are too restrictive.  I don’t like the one on sourceforge because I like a project to have its own look and feel.

- It must be social. People should be able to comment on the documentation. Readers can add lots of good information and tips for other readers that would save us a lot of work. (They can also call us out if things don’t make sense.) The PHP and MySQL sites were great examples of this where people would give tips on how to use certain functions. Often times I would find the answers for what I was looking for in the readers comments.

- It must have a consistent format and be easy to follow. The problem with the PDF approach is that as it grew, nobody would want to combine them and so we had a proliferation of PDF documents.  (Hence the reason for the Top Document) We need a way to be reorganize, add, expand and be very flexible.  In addition we need to have a consistent format and be easy on the eyes :-)

So I looked at documentation projects I liked. I already mentioned PHP and MySQL. jQuery was another one that I thought was excellent. They are all web based. This makes it better: Now people can just use a search engine to find the answers they need! Something you can not get in a static document living in a repository.  But what about the people installing xCAT that don’t have access to internet?  Fine make it so we have a PDF that can be generated on the fly.

I also looked at commercial products.  This just seemed kind of wrong to do for an open source project.  They don’t lend themselves to collaboration and I just didn’t get a good vibe about them.  Plus, I hate opening up other applications.  I want to do it all through the browser!

So web based authoring seemed the way to go based on all I saw. But what platform? I liked the navigation of Red Hat’s documentation but I didn’t want to be forced to use Linux nor DocBook. Too much work to update.  Plus, there was no commenting system. Drupal and others seemed like a bit of an overkill as did Word Press.  I would have to spend more time learning how to customize them than just writing something from scratch.  At We don’t have a lot of control over that environment because its hosted by sourceforge. And logging in takes forever and is quite frustrating to me.  So I had to find another place to host it.

So here is my solution: is a rails application. We made it so that we could extend the Sumavisor (which is also a rails application) for our future evil plans. Why not just create a quick documentation model? I sat down for 3 days and wrote one out, tried to stylize it and even integrated Disqus (a la jQuery) to have a commenting system.
The system still has a few problems, but was easier to write than I thought. (yes, you elite programmers are shocked that it took me 3 days to do, but hey, I had to deploy it as well as do my day job) I then started adding the content of my book and so far I am very pleased. The navigation is easier, I can change things around quicker and I have to say it is very agile.
I am hopeful in the coming days that we will solve the xCAT Documentation problem. I am hoping that more people will use xCAT because they will see how easy it is to use. I am hoping that people will be able to use search engines to find answers to xCAT problems. The mailing list has been a great help to this, but its time to get more modern.  I am hoping to have something mostly done by mid October.

For any feature-rich software like xCAT, it is imperative to have a good documentation system.  It can be one of the things that sets you apart from the crowd.  Its frustrating not being able to get things work.  I’ve downloaded many packages and disgarded them after not being able to get it to run.   Its also important to use the collective wisdom of the community.  But the community needs an initial structure to build on.

Take a look at the docs we’ve started and let me know your thoughts.  Like I said, we’re looking to have something usable by mid October.  You can look at them here.