How to Ruin Your Weekend and Other Hazards of Mis-Configured HA

NOTE: This was originally posted in October, 2009, and may not be a problem any more with current versions of XenServer, as some of the more recent comments would tend to verify - but we will keep the post active for historical purposes. (added by blog administrator, March 16, 2012)

The Level 1 HA (High Availability) feature that comes with Citrix Essentials for XenServer may be one of the best ways to crash your whole virtual infrastructure if you don’t understand how it works and don’t design in an appropriate level of redundancy. This of course will lead to hours of down time, unhappy management, possible data loss, and lots of extra work for you (most likely on a weekend).

The basics -
HA is designed to monitor the XenServer virtualization environment. When HA is enabled, the administrator can specify which virtual machines (VMs) need to be automatically restarted if the host server they’re running on should fail. If there is a failure of a host server, HA should then automatically restart its designated guest VMs on another host in the XenServer “resource pool.” Note that the HA function does not “live migrate” the guest VMs, because when a host fails the VMs on that host also fail. Rather, it selects another host server and restarts the VMs on that host. For all of this to happen correctly, Citrix’s HA requires two things to be true at all times:

  1. Each XenServer must be able to communicate with its peers in the pool.
  2. Each XenServer in the pool requires access at all times to the HA heartbeat disk, which is shared by all the XenServers in the pool.

If either of these two items is not true for any given XenServer in the pool, that server will “fence.” The short definition of “fencing” is that the XenServer suspects – although it’s not absolutely sure – that it is experiencing some kind of failure, so to protect against possible data corruption it shuts itself down – essentially sacrificing itself to protect the data – until a human comes along and sorts things out. If the fenced server is in a correctly configured HA pool, guest VMs that were configured for HA restart will be restarted on a surviving XenServer.

Considerations -
So… you have two XenServers all set up and all your VMs configured just the way you like them, and you decide to turn on HA. Everything appears to be working until one of the hosts suffers a failure and goes off line. (Murphy’s Law says this will happen on a Saturday evening right before your BBQ party is starting.) With HA enabled, you would expect, based on the whole “High Availability” concept, that everything would be OK. Critical VMs should get restarted on the other host and you should be able to deal with the failed host on Monday.

Oh, but wait, remember HA rule #1? The XenServer host that is still running suddenly does not have any peers to talk to. It no longer knows whether or not it’s healthy so, in the interest of protecting your data from corruption, it does what it’s designed to do – it fences, and now both of your XenServers are down. They may try to reboot, but you are now in an endless loop of fencing, and to get it resolved, you’re going to have to know how to use the “xe host-emergency-ha-disable force=true” command to resolve your problems. (And if you don’t understand that last sentence, you’re in for a long weekend.)

This results in a situation that we in IT refer to as “not good,” with a chance of “career altering,” and you’re going to miss your BBQ party.

Here’s another scenario that will spoil your party: What if both XenServers are actually healthy, and all the virtual servers are up and functioning, but the network link for the management communications between the XenServers fails? Again, each XenServer would think it was stranded from the pool and fence itself in an attempt to correct the issue. With both servers fencing, this would again create an endless loop of server fencing. In essence, one server would start to come back online and would still not see the other XenServer and would fence again, and so on, and so on.

So for those reasons a two-XenServer pool cannot successfully run HA! Just don’t do it - even though you can configure HA on a two-server pool the result can be disastrous and ruin your weekend…not to mention your next performance review.

Well, what about HA in a three node XenServer pool? Based upon the previously described scenarios, you now have a valid “pool,” in which HA will function. So you configure and enable HA, and when you test the HA functionality by killing one of the XenServers, everything works like it is supposed to. The guest VMs are restarted on the surviving XenServer hosts and you’re happy that everything is working correctly.

But here is another “gotcha!” If you have only one Ethernet interface per XenServer assigned to management, and they’re all plugged into one switch, what happens if the management link fails because a NIC fails – or even worse, the switch fails? If it’s just a NIC in one server, then that XenServer will fence – not too bad but still not what you want. If you were using a different set of NICs (as you always should) for the guest VMs to communicate with the rest of the world, then the guests on that server were probably up and working just fine until the server fenced. Sure, the critical ones will restart on the remaining servers, but you’ve lost a third of the resources in your pool unnecessarily.

Now let’s consider what would happen if the switch should fail and you had only single management ports on each XenServer all plugged into just that one switch. If this happens, it may be time to dust off the old resume, because you have just lost your entire XenServer pool. Why? Because when the switch went down, all the XenServers lost communication with one another, and each assumed that, because it was suddenly isolated from the pool, it must be experiencing some kind of failure. Therefore the whole pool fenced.

Conclusions -
Citrix’s HA does not work in a two host pool, period. With a pool of three or more XenServers you’ll be OK if you design the infrastructure correctly so that there is no single point of failure in your peer communications. How? Simply by bonding together two NICs, dedicating them to the management communication function, and then splitting the bonded pairs between two separate Ethernet switches. That way you’re protected against both a NIC failure and a switch failure.

But you’re not out of the woods yet! Don’t forget HA rule #2 – servers need to see the HA heartbeat disk. This is equally important, and you must consider the topology of that side of the network (iSCSI, Fiber, etc.) and be sure it is also redundant. And if you’re using iSCSI multi-pathing (e.g., with a pair of mirrored DataCore iSCSI SAN nodes), be sure to manually bump up the HA timeout interval so that if one of the SAN nodes should fail, the multi-pathing function has time to fail over to the other node before the XenServers all conclude that the HA heartbeat disk is gone – otherwise, again, they will all fence. Our testing indicates that a two minute timeout appears to have an adequate margin of safety. The default setting of one minute (oops - the default is actually 30 seconds) is definitely too short. Unfortunately, this setting does not appear to be persistent, so if you turn HA off and then back on, you’ll need to manually reset the timeout interval again. (This is probably a job for Workflow Studio, but we just haven’t had time to work through the process yet.)

NO Single Points of Failure
HA will do a fine job of protecting you, if you build the network correctly. So make sure you’ve built in enough redundancy that you have no single point of failure, and enjoy your BBQ.

P.S.: If you can’t justify more than two XenServers, but you still have one or more critical guests that need to be highly available, there is a solution: Marathon Technologies’ everRun VM. But that’s another post for another day.

18 Thoughts on “How to Ruin Your Weekend and Other Hazards of Mis-Configured HA

  1. I can confirm this is still an issue in 6.2.x. We had three clusters all which ran fine for quite some time, then bam! Each one took a tumble over a two month period. For obvious reasons HA never really actually worked, so we just wound up disabling it. This article helped be pinpoint the issue, so I appreciate you writing it! Now to decide what exactly to do about it.

  2. Andrew Debbink on May 24, 2013 at 1:37 pm said:

    Unfortunately, the current release of XenServer still has an issue with High Availability in a two host server pool. Citrix XenServer 6.0/6.1 Administrative Guides note that “Citrix recommends that you enable HA only in pools that contain at least 3 XenServer hosts. For details on how the HA feature behaves when the heartbeat is lost between two hosts in a pool, see the Citrix Knowledge Base article CTX129721.” I have confirmed this with Citrix support and tested this after hours on our two host pool. When the network heartbeat over the primary management interface is interrupted with only two hosts in a HA enabled pool, one particular host will always fence and reboot, regardless of which host failed. The host with the higher UUID will always and fence and reboot.

  3. I think it’s great that this post continues to generate this much interest. Please note, however, that the original post dates from October, 2009. We do not believe that there is as much risk today with the current generation of XenServer, and your most recent comments tend to bear that out. Thank you all for posting.

    • Actually no, the current generation of XenServer still have the issue.

      “Further to your query, issue described in CTX129721 is applicable to 6.0 and 6.1″ - Citrix Technical Support (May 4, 2013)


  4. Ellis on March 16, 2012 at 3:50 am said:

    I may be in fantasy land here, but if the two hosts lose management interface connectivity to one another but has a separate path to the Heartbeat SR which stays up, can the two hosts not “figure it out” by the updates made to the Heartbeat SR file(s)? Surely they can work out which host still has network connectivity and which is the most appropriate server to fance?

    Also, I’ve been running HA in two-server pools for a couple of years using XS5.5, XS5.6 and XS6 and they have always acted as expected when a server goes awry. Is the behaviour different in XenServer 6? I think the article may need some more details to avoid it becoming misleading and be accused of pushing everRun..

  5. Tomislav on March 15, 2012 at 10:33 am said:

    Great post, but i have one question/objection on configuration with two hosts.
    If I put bonded nic’s , two switch, multipathing on FC HBA, 2 HBA’s why should not have HA with 2 XenServer? Dont see the point of failure?

  6. Excellent explanation, never actually realized that Xenserver would ‘fence’ to protect it’s data. Yeah, it happened to me too when a SAN controller did a failover. Three server setup with XS5.0, using HBA’s. The only one that remained online was the pool master, the other 2 restarted kicking off VM’s and all users at 8.17 in the morning. Ouch! The heartbeat link was gone for about 50 seconds due to the failover. Pretty short time on HA timeout too, 30 seconds is not much. Very interesting article, kudos!

  7. This is a fantastic explanation of host self-fencing! Thanks!

  8. 1rst, sorry 4 my poor english. Our citrix server farm (3) went into emergency mode after the ups starts (!!!) after a brown out in our district. After returning to the office, i checked the server. All running (including our heartbeat/nas), allthough not reachable via xencenter. Ping, ok, putty a mess, lousy responses, xsconsole shows no nics. you may ask for the vms, well they were running. I was really surprised. Ok, sherlock mode on and i found the reason why. All three server were plugged on the same switch. the switch was temporarily out of order (a few seconds, imho) until it returned to normal operation. These few seconds lasts in the result, that none of the server could find his heartbeat to the iSCSI/nas and fences (emergency mode). I seperate now the three server on different switches an add a single ups for the switch and the NAS (keeping up the heartbeat). Btw. the server are operated in a server room which is protected by a huge ups. Scaring (…)
    Last but not least: If one of the server without ha enabled (-forced) should be back to operation in the Pool, don’t search 4 an xe command, just disable HA for the pool and activate it again, the missing slave joins the pool, thats all.

  9. Nice Write Up (though could be misleading in some areas until you get to the comments)… It may however be a good idea to indicate that the afore mentioned is no longer an issue within the writeup, especially since you have confirmed same in your lab.

    What version of XenServer are you testing with?

  10. @Oliver - Thanks for your input. Your comment prompted us to go back and re-test with the latest version of XenServer. Although our testing is not yet complete, it does appear that some of the conditions we described in this blog post no longer cause both XenServers to fence. Once our testing is complete, we’ll be posting an update on our findings.

  11. Oliver on August 30, 2010 at 4:11 am said:

    I don’t understand this. I have two Xen Servers in a pool and I have no problems. I test it. First I remove the first server from network, the protectet machine starts on the other server, than I test it with the other server, also no problem. I have no fencing……

  12. Pingback: XenServer Host Is In Emergency Mode | Moose Logic Blog

  13. Thanks for this post! I have a three XenServer pool and had enabled HA tonight around midnight whilst doing other sorts of network maint… recabling a few things… I must’ve bumped a cable and suddenly both servers at the building I was in rebooted. I thought perhaps there was a power issue in the rack or that I’d been a klutz… then I looked in XenCenter and all three servers had ‘fenced’. Ugh. At least I found this out late at night and read your post. I’m going to look into getting enough redundancy with switches and NIC bonding so that I can have HA enabled.

  14. Frost Hon on January 26, 2010 at 3:18 am said:

    Excellent Post, really love how your explain everything in detail. we are testing our HA as well and came across your article.

  15. Jonathan Snyder on November 3, 2009 at 5:43 pm said:

    Very interesting. I have four xenservers, two in pool A, and two in pool B. It’s a long story why. At any rate, I have been planning my configuration for HA lately (albeit slowly) and am glad I ran across this article. It just may have saved my butt.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Post Navigation