Monthly Archives: May 2011

You are browsing the site archives by month.

It’s Been a Cloud-y Week

No, I’m not talking about the weather here in San Francisco – that’s actually been pretty good. It’s just that everywhere you look here at the Citrix Summit / Synergy conference, the talk is all about clouds - public clouds, private clouds, even personal clouds, which, according to Mark Templeton’s keynote on Wednesday, refers to all your personal stuff:

  • My Devices – of which we have an increasing number
  • My Preferences – which we want to be persistent across all of our devices
  • My Data – which we want to get to from wherever we happen to be
  • My Life – which increasingly overlaps with…
  • My work – which I want to use My Devices to perform, and which I want to reflect My Preferences, and which produces Work Data that is often all jumbled up with My Data (and that can open up a whole new world of problems, from security of business-proprietary information to regulatory compliance).

These five things overlap in very fluid and complex ways, and although I’ve never heard them referred to as a “personal cloud” before, we do need to think about all of them and all of the ways they interact with each other. So if creating yet another cloud definition helps us do that, I guess I’m OK with that, as long as nobody asks me to build one.

But lest I be accused of inconsistency, let me quickly recap the cloud concerns that I shared in a post about a month ago, hard on the heels of the big Amazon EC2 outage:

  1. We have to be clear in our definition of terms. If “cloud” can simply mean anything you want it to mean, then it means nothing.
  2. I’m worried that too many people are running to embrace the public cloud computing model while not doing enough due diligence first:
    1. What, exactly, does your cloud provider’s SLA say?
    2. What is their track record in living up to it?
    3. How well will they communicate with you if problems crop up?
    4. How are you insuring that your data is protected in the event that the unthinkable happens, there’s a cloud outage, and you can’t get to it?
    5. What is your business continuity plan in the event of a cloud outage? Have you planned ahead and designed resiliency into the way you use the cloud?
    6. Never forget that, no matter what they tell you, nobody cares as much about your stuff as you do. It’s your stuff. It’s your responsibility to take care of it. You can’t just throw it into the cloud and never think about it again.

Having said that, and in an attempt to adhere to point #1 above, I will henceforth stick to the definitions of cloud computing set forth in the draft document (#800-145) released by the National Institute of Standards and Technology in January of this year, and I promise to tell you if and when I deviate from those definitions. The following are the essential characteristics of cloud computing as defined in that draft document:

  • On-demand self-service. A consumer can unilaterally provision computing capabilities, such as server time and network storage, as needed automatically without requiring human interaction with each service’s provider.
  • Broad network access. Capabilities are available over the network and accessed through standard mechanisms that promote use by heterogeneous thin or thick client platforms (e.g., mobile phones, laptops, and PDAs).
  • Resource pooling. The provider’s computing resources are pooled to serve multiple consumers using a multi-tenant model, with different physical and virtual resources dynamically assigned and reassigned according to consumer demand. There is a sense of location independence in that the customer generally has no control or knowledge over the exact location of the provided resources but may be able to specify location at a higher level of abstraction (e.g., country, state, or datacenter). Examples of resources include storage, processing, memory, network bandwidth, and virtual machines.
  • Rapid elasticity. Capabilities can be rapidly and elastically provisioned, in some cases automatically, to quickly scale out, and rapidly released to quickly scale in. To the consumer, the capabilities available for provisioning often appear to be unlimited and can be purchased in any quantity at any time.
  • Measured Service. Cloud systems automatically control and optimize resource use by leveraging a metering capability at some level of abstraction appropriate to the type of service (e.g., storage, processing, bandwidth, and active user accounts). Resource usage can be monitored, controlled, and reported, providing transparency for both the provider and consumer of the utilized service.

If you’ll read through those points a couple of times and give it a moment’s thought, a couple of things should become obvious.

First, most of the chunks of infrastructure that are being called “private clouds” aren’t – at least by the definition above. Standing up a XenApp or XenDesktop infrastructure, or even a mixed environment of both, does not mean that you have a private cloud, even if you access it from the Internet. Virtualizing a majority, or even all, of your servers doesn’t mean you have a private cloud.

Second, very few Small & Medium Enterprises can actually justify the investment required to build a true private cloud as defined above, although some of the technologies that are used to build public and private clouds (such as virtualization, support for broad network access, and some level of user self-service provisioning) will certainly trickle down into SME data centers. Instead, some will find that it makes sense to move some services into public clouds, or to leverage public clouds to scale out or scale in to address their elasticity needs. And some will decide that they simply don’t want to be in the IT infrastructure business anymore, and move all of their computing into a public cloud. And that’s not a bad thing, as long as they pay attention to my point #2 above. If that’s the way you feel, we want to help you do it safely, and in a way that meets your business needs. That’s one reason why I’ve been here all week.

So stay tuned, because we’ll definitely be writing more about the things we’ve learned here, and how you can apply them to make your business better.

IntelliCache and the IOPS Problem

If you’ve been following this blog for any length of time, you know that we’ve written extensively about XenDesktop, and spent a lot of time on best practices and problems to avoid. And one of the biggest problems to avoid is poor storage design resulting in poor VDI performance.

In a nutshell, the problem is that a Windows desktop OS uses disk far differently than a Windows server OS. Thanks to the way Windows uses the swap file, disk writes outnumber disk reads by about 2 to 1. You can build your virtual desktop infrastructure on the latest and greatest server hardware, with tons of processing power and insanely huge amounts of RAM, but if all of the disk I/O for all of those virtual desktops is hitting your SAN, you’ve got a scalability problem on your hands.

Provisioning Services (“PVS”) can help to mitigate this in two ways (assuming for sake of argument that you’re provisioning multiple virtual systems from a common, read-only image): First, if you build your Provisioning Servers correctly, you should be able to serve up most of the OS read operations from the Provisioning Server’s own cache memory. Second, you can use the virtualization host’s local disk storage as the required “write cache” - because all of those write operations have to go somewhere while the virtual system is running.

But XenDesktop 5 introduced a new way to provision desktops called “Machine Creation Services” (“MCS”). We wrote about this in the April edition of our Moose Views newsletter, so if you’re not familiar with all the pros and cons of MCS vs. PVS, I’d encourage you to take a brief time out and read that article. Suffice it to say that, despite all the advantages of MCS, the biggest downside of using MCS to provision pooled desktops was that all of the IOPS hit your SAN storage, which limited the scalability of an MCS-provisioned VDI deployment.

But all of that just changed, with the release of XenDesktop 5 Service Pack 1, which was made available for download a week ago (May 13). With SP1, XenDesktop 5 is now able to take advantage of the “IntelliCache” feature that was introduced as part of XenServer v5.6 Service Pack 2. Using MCS with the combination of XenDesktop 5 SP1 and XenServer SP2…

  • The first time a virtual desktop is booted on a given XenServer, the boot image is cached on that XenServer’s local storage.
  • Subsequent virtual desktops booted on that same XenServer will boot and run from that locally cached image.
  • You can use the XenServer’s local storage for the write cache as well.

The bottom line is that you can move as much as 90% of the IOPS off of the SAN and onto local XenServer storage, removing nearly all of the scalability limitations from an MCS-provisioned environment.

With most of the IOPS for running VMs taking place on local storage, it’s pretty straightforward to figure out how many VMs you can expect to support on a given virtualization host. Dan Feller’s blog post does a great job of walking through the process of calculating the functional IOPS that your local XenServer storage repository should be able to support, and inferring from that number how many light, normal, or power users you should be able to support as a result.

This also means that using XenServer as the hypervisor for your XenDesktop 5 deployment is going to yield a significant performance advantage over any other hypervisor, unless or until the other guys come out with similar local caching features. So, if you’re a VMware shop, my advice is this: Go ahead and virtualize all of the supporting XenDesktop server components on your VSphere infrastructure. Run your XenDesktop 5 VMs on XenServer hosts, and just don’t tell anyone! If you’re asked, just say, “Oh, yeah, these are my XenDesktop host systems - they’re completely separate from our VSphere infrastructure, because we don’t need the (insert favorite VSphere feature) function for these systems.” Your infrastructure will run better, and no one will know but you…

High Availability vs. Fault Tolerance

Many times, terms like “High Availability” and “Fault Tolerance” get thrown around as though they were the same thing. In fact, the term “fault tolerant” can mean different things to different people - and much like the terms “portal,” or “cloud,” it’s important to be clear about exactly what someone means by the term “fault tolerant.”

As part of our continuing efforts to guide you through the jargon jungle, we would like to discuss redundancy, fault tolerance, failover, and high availability, and we’d like to add one more term: continuous availability.

Our friends at Marathon Technologies shared the following graphic, which shows how IDC classifies the levels of availability:

Graphic of Availability Levels

The Availability Pyramid



Redundancy is simply a way of saying that you are duplicating critical components in an attempt to eliminate single points of failure. Multiple power supplies, hot-plug disk drive arrays, multi-pathing with additional switches, and even duplicate servers are all part of building redundant systems.

Unfortunately, there are some failures, particularly if we’re talking about server hardware, that can take a system down regardless of how much you’ve tried to make it redundant. You can build a server with redundant hot-plug power supplies and redundant hot-plug disk drives, and still have the system go down if the motherboard fails - not likely, but still possible. And if it does happen, the server is down. That’s why IDC classifies this as “Availability Level 1″ (“AL1″ on the graphic)…just one level above no protection at all.

The next step up is some kind of failover solution. If a server experiences a catastrophic failure, the work loads are “failed over” to a system that is capable of supporting those workloads. Depending on those work loads, and what kind of fail-over solution you have, that process can take anywhere from minutes to hours. If you’re at “AL2,” and you’ve replicated your data using, say, SAN replication or some kind of server-to-server replication, it could take a considerable amount of time to actually get things running again. If your servers are virtualized, with multiple virtualization hosts running against a shared storage repository, you may be able to configure your virtualization infrastructure to automatically restart a critical workload on a surviving host if the host it was running on experiences a catastrophic failure - meaning that your critical system is back up and on-line in the amount of time it takes the system to reboot - typically 5 to 10 minutes.

If you’re using clustering technology, your cluster may be able to fail over in a matter of seconds (“AL3″ on the graphic). Microsoft server clustering is a classic example of this. Of course, it means that your application has to be cluster-aware, you have to be running Windows Enterprise Edition, and you may have to purchase multiple licenses for your application as well. And managing a cluster is not trivial, particularly when you’ve fixed whatever failed and it’s time to unwind all the stuff that happened when you failed over. And your application was still unavailable during whatever interval of time was required for the cluster to detect the failure and complete the failover process.

You could argue that a fail over of 5 minutes or less equals a highly available system, and indeed there are probably many cases where you wouldn’t need anything better than that. But it is not truly fault tolerant. It’s probably not good enough if you are, say, running a security application that’s controlling the smart-card access to secured areas in an airport, or a video surveillance system that sufficiently critical that you can’t afford to have a 5-minute gap in your video record, or a process control system where a five minute halt means you’ve lost the integrity of your work in process and potentially have to discard thousands of dollars worth of raw material and lose thousands more in lost productivity while you clean out your assembly line and restart it.

That brings us to the concept of continuous availability. This is the highest level of availability, and what we consider to be true fault tolerance. Instead of simply failing workloads over, this level allows for continuous processing without disruption of access to those workloads. Since there is no disruption in service there is no data loss, no loss of productivity and no waiting for your systems to restart your workloads.

So all this leads to the question of what your business needs.

Do you have applications that are critical to your organization? If those applications go down how long could you afford to be without access to them? If those applications go down how much data can you afford to lose? 5 minutes? An hour? And, most importantly, what does it cost you if that application is unavailable for a period of time? Do you know, or can you calculate it?

This is another way to ask what the requirements are for your “RTO” (“Recovery Time Objective” - i.e., how long, when a system goes down, do you have before you must be back up) and “RPO” (“Recovery Point Objective” - i.e., when you do get the system back up, how much data it is OK to have lost in the process). We’ve discussed these concepts in previous posts. These are questions that only you can answer, and the answers are significantly different depending on your business model. If you’re a small business, and your accounting server goes down, and all it means is that you have to wait until tomorrow to enter today’s transactions, it’s a far different situation from a major bank that is processing millions of dollars in credit card transactions.

If you can satisfy your business needs by deploying one of the lower levels of availability, great! Just don’t settle for an AL1 or even an AL3 solution if what your business truly demands is continuous availability.