Tag Archives: Business Continuity

InterOp Las Vegas 2014

I spent the day at InterOp Las Vegas 2014 and for me it was “Geek Heaven”.  InterOp has always been one of my favorite technology trade shows, and this year didn’t disappoint. I have been told I don’t appear “geeky” but I am indeed a true geek through and through!  Walking the isles for hours absorbing all the latest technology is pretty good entertainment for any geek. Today I was impressed with technology from many vendors but there were a few standouts. Pogo Linux in partner with Sandisk is building some very impressive performance into several Storage Area Network products including Nexenta and OS Nexus QuantaStor. If you are in the market for a very high performance “Software Defined Storage” solution these are excellent choices that will outperform the big boys at a fraction of the cost. I spent some time in the Suse Linux booth chatting about their Open Stack Cloud products and promised to spin up a test of their solution when I return home. Stratus Technologies was another product that I believe is continuing to evolve with their recently acquired everRun High Availability/Fault tolerant server technology. I have worked with everRun since the 90’s and its always been a very impressive way to keep workloads running even when multiple components or even a whole server fails. This is done by using a slick mirrored server instance with two Virtual Machines running software in “lockstep”, which means the server Operating Systems and application software are running simultaneously on both servers, if one fails the surviving host simply keeps humming along like nothing happened. Of course top vendors Citrix, Cisco, HP, Dell, and many others were there in force. Interop was founded 26 years ago as a workshop on TCP/IP and has a rich history as a conference to raise the skills of technology professionals.

The Red Cross Wants to Help You with DR Planning

Ready Rating Program Seal

A few days ago, I spotted a headline in the local morning paper: “SBA Partners with the Red Cross to Promote Disaster Planning.” We’ve written some posts in the past that dealt with the importance of DR planning, and how to go about it, so this piqued my curiosity enough that I visited the Red Cross “Ready Rating” Web site. I was sufficiently impressed with what I found there that I wanted to share it with you.

Membership in the Ready Rating program is free. All you have to do to become a member is to sign up and take the on-line self-assessment, which will help you determine your current level of preparedness. And I’m talking about overall business preparedness, not just IT preparedness. The assessment rates you on your responses to questions dealing with things like:

  • Have you conducted a “hazard vulnerability assessment,” including identifying appropriate emergency responders (e.g., police, fire, etc.) in your area and, if necessary, obtaining agreements with them?
  • Have you developed a written emergency response plan?
  • Has that plan been communicated to employees, families, clients, media representatives, etc.?
  • Have you developed a “continuity of operations plan?”
  • Have you trained your people on what to do in an emergency?
  • Do you conduct regular drills and exercises?

That last point is more important than you might think. It’s not easy to think clearly when you’re in the middle of an earthquake, or when you’re trying to find the exit when the building is on fire and there’s smoke everywhere. The best way to insure that everyone does what they’re supposed to do is to drill until the response is automatic. It’s why we had fire drills when we were in elementary school. It’s still effective now that we’re all grown up.

Once you become a member, your membership will automatically renew from year to year, as long as you take the self-assessment annually and can show that your score has improved from the prior year. (Once your score reaches a certain threshold, you’re only required to maintain that level to retain your membership.)

So, why should you be concerned about this? It’s hard to imagine that, after the tsunami in Japan and the flooding and tornadoes here at home, there’s anyone out there who still doesn’t get it. But, just in case, consider these points taken from the “Emergency Fast Facts” document in the members’ area:

  • Only 2 in 10 Americans feel prepared for a catastrophic event.
  • Close to 60% of Americans are wholly unprepared for a disaster of any kind.
  • 54% of Americans don’t prepare because they believe a disaster will not affect them - although 51% of Americans have experienced at least one emergency situation where they lost utilities for at least three days, had to evacuate and could not return home, could not communicate with family members, or had to provide first aid to others.
  • 94% of small business owners believe that a disaster could seriously disrupt their business within the next two years.
  • 15 - 40% of small businesses fail following a natural or man-made disaster.

If you’re not certain how to even get started, they can help there as well. Here’s a screen capture showing a partial list of the resources available in the members’ area:

You may also want to review the following articles and posts:

And speaking of getting started, check this out: Just about everything I’ve ever read about disaster preparedness talks about the importance of having a “72-hour kit” - something that you can quickly grab and take with you that contains everything you need to survive for three days. Well, for those of you who haven’t got the time to scrounge up all of the recommended items and pack them up, you may find the solution at your local Costco. Here’s what I spotted on my most recent trip:

Yep, it’s a pre-packaged 3-day survival kit. The cost at my local store (in Woodinville, WA, if you’re curious) was $69.95. That, in my opinion, is a pretty good deal.

So, if you haven’t started planning yet, consider this your call to action. Don’t end up as a statistic. You can do this.

High Availability vs. Fault Tolerance

Many times, terms like “High Availability” and “Fault Tolerance” get thrown around as though they were the same thing. In fact, the term “fault tolerant” can mean different things to different people - and much like the terms “portal,” or “cloud,” it’s important to be clear about exactly what someone means by the term “fault tolerant.”

As part of our continuing efforts to guide you through the jargon jungle, we would like to discuss redundancy, fault tolerance, failover, and high availability, and we’d like to add one more term: continuous availability.

Our friends at Marathon Technologies shared the following graphic, which shows how IDC classifies the levels of availability:

Graphic of Availability Levels

The Availability Pyramid

Redundancy is simply a way of saying that you are duplicating critical components in an attempt to eliminate single points of failure. Multiple power supplies, hot-plug disk drive arrays, multi-pathing with additional switches, and even duplicate servers are all part of building redundant systems.

Unfortunately, there are some failures, particularly if we’re talking about server hardware, that can take a system down regardless of how much you’ve tried to make it redundant. You can build a server with redundant hot-plug power supplies and redundant hot-plug disk drives, and still have the system go down if the motherboard fails - not likely, but still possible. And if it does happen, the server is down. That’s why IDC classifies this as “Availability Level 1″ (“AL1″ on the graphic)…just one level above no protection at all.

The next step up is some kind of failover solution. If a server experiences a catastrophic failure, the work loads are “failed over” to a system that is capable of supporting those workloads. Depending on those work loads, and what kind of fail-over solution you have, that process can take anywhere from minutes to hours. If you’re at “AL2,” and you’ve replicated your data using, say, SAN replication or some kind of server-to-server replication, it could take a considerable amount of time to actually get things running again. If your servers are virtualized, with multiple virtualization hosts running against a shared storage repository, you may be able to configure your virtualization infrastructure to automatically restart a critical workload on a surviving host if the host it was running on experiences a catastrophic failure - meaning that your critical system is back up and on-line in the amount of time it takes the system to reboot - typically 5 to 10 minutes.

If you’re using clustering technology, your cluster may be able to fail over in a matter of seconds (“AL3″ on the graphic). Microsoft server clustering is a classic example of this. Of course, it means that your application has to be cluster-aware, you have to be running Windows Enterprise Edition, and you may have to purchase multiple licenses for your application as well. And managing a cluster is not trivial, particularly when you’ve fixed whatever failed and it’s time to unwind all the stuff that happened when you failed over. And your application was still unavailable during whatever interval of time was required for the cluster to detect the failure and complete the failover process.

You could argue that a fail over of 5 minutes or less equals a highly available system, and indeed there are probably many cases where you wouldn’t need anything better than that. But it is not truly fault tolerant. It’s probably not good enough if you are, say, running a security application that’s controlling the smart-card access to secured areas in an airport, or a video surveillance system that sufficiently critical that you can’t afford to have a 5-minute gap in your video record, or a process control system where a five minute halt means you’ve lost the integrity of your work in process and potentially have to discard thousands of dollars worth of raw material and lose thousands more in lost productivity while you clean out your assembly line and restart it.

That brings us to the concept of continuous availability. This is the highest level of availability, and what we consider to be true fault tolerance. Instead of simply failing workloads over, this level allows for continuous processing without disruption of access to those workloads. Since there is no disruption in service there is no data loss, no loss of productivity and no waiting for your systems to restart your workloads.

So all this leads to the question of what your business needs.

Do you have applications that are critical to your organization? If those applications go down how long could you afford to be without access to them? If those applications go down how much data can you afford to lose? 5 minutes? An hour? And, most importantly, what does it cost you if that application is unavailable for a period of time? Do you know, or can you calculate it?

This is another way to ask what the requirements are for your “RTO” (“Recovery Time Objective” - i.e., how long, when a system goes down, do you have before you must be back up) and “RPO” (“Recovery Point Objective” - i.e., when you do get the system back up, how much data it is OK to have lost in the process). We’ve discussed these concepts in previous posts. These are questions that only you can answer, and the answers are significantly different depending on your business model. If you’re a small business, and your accounting server goes down, and all it means is that you have to wait until tomorrow to enter today’s transactions, it’s a far different situation from a major bank that is processing millions of dollars in credit card transactions.

If you can satisfy your business needs by deploying one of the lower levels of availability, great! Just don’t settle for an AL1 or even an AL3 solution if what your business truly demands is continuous availability.

How’s That “Cloud” Thing Working For You?

Color me skeptical when it comes to the “cloud computing” craze. Well, OK, maybe my skepticism isn’t so much about cloud computing per se as it is about the way people seem to think it is the ultimate answer to Life, the Universe, and Everything (shameless Douglass Adams reference). In part, that’s because I’ve been around IT long enough that I’ve seen previous incarnations of this concept come and go. Application Service Providers were supposed to take the world by storm a decade ago. Didn’t happen. The idea came back around as “Software as a Service” (or, as Microsoft preferred to frame it, “Software + Services”). Now it’s cloud computing. In all of its incarnations, the bottom line is that you’re putting your critical applications and data on someone else’s hardware, and sometimes even renting their Operating Systems to run it on and their software to manage it. And whenever you do that, there is an associated risk – as several users of Amazon’s EC2 service discovered just last week.

I have no doubt that the forensic analysis of what happened and why will drag on for a long time. Justin Santa Barbara had an interesting blog post last Thursday (April 21) that discussed how the design of Amazon Web Services (AWS), and its segmentation into Regions and Availability Zones, is supposed to protect you against precisely the kind of failure that occurred last week…except that it didn’t.

Phil Wainewright has an interesting post over at ZDnet.com on the “Seven lessons to learn from Amazon’s outage.” The first two points he makes are particularly important: First, “Read your cloud provider’s SLA very carefully” – because it appears that, despite the considerable pain some of Amazon’s customers were feeling, the SLA was not breached, legally speaking. Second, “Don’t take your provider’s assurances for granted” – for reasons that should be obvious.

Wainewright’s final point, though, may be the most disturbing, because it focuses on Amazon’s “lack of transparency.” He quotes BigDoor CEO Keith Smith as saying, “If Amazon had been more forthcoming with what they are experiencing, we would have been able to restore our systems sooner.” This was echoed in Santa Barbara’s blog post where, in discussing customers’ options for failing over to a different cloud, he observes, “Perhaps they would have started that process had AWS communicated at the start that it would have been such a big outage, but AWS communication is – frankly – abysmal other than their PR.” The transparency issue was also echoed by Andrew Hickey in an article posted April 26 on CRN.com.

CRN also wrote about “lessons learned,” although they came up with 10 of them. Their first point is that “Cloud outages are going to happen…and if you can’t stand the outage, get out of the cloud.” They go on to talk about not putting “Blind Trust” in the cloud, and to point out that management and maintenance are still required – “it’s not a ‘set it and forget it’ environment.”

And it’s not like this is the first time people have been affected by a failure in the cloud:

  • Amazon had a significant outage of their S3 online storage service back in July, 2008. Their northern Virginia data center was affected by a lightning strike in July of 2009, and another power issue affected “some instances in its US-EAST-1 availability zone” in December of 2009.
  • Gmail experienced a system-wide outage for a period of time in August, 2008, then was down again for over 1 ½ hours in September, 2009.
  • The Microsoft/Danger outage in October, 2009, caused a lot of T-Mobile customers to lose personal information that was stored on their Sidekick devices, including contacts, calendar entries, to-do lists, and photos.
  • In January, 2010, failure of a UPS took several hundred servers offline for hours at a Rackspace data center in London. (Rackspace also had a couple of service-affecting failures in their Dallas area data center in 2009.)
  • Salesforce.com users have suffered repeatedly from service outages over the last several years.

This takes me back to a comment made by one of our former customers, who was the CIO of a local insurance company, and who later joined our engineering team for a while. Speaking of the ASPs of a decade ago, he stated, “I wouldn’t trust my critical data to any of them – because I don’t believe that any of them care as much about my data as I do. And until they can convince me that they do, and show me the processes and procedures they have in place to protect it, they’re not getting my data!”

Don’t get me wrong – the “Cloud” (however you choose to define it…and that’s part of the problem) has its place. Cloud services are becoming more affordable, and more reliable. But, as one solution provider quoted in the CRN “lessons learned” article put it, “Just because I can move it into the cloud, that doesn’t mean I can ignore it. It still needs to be managed. It still needs to be maintained.” Never forget that it’s your data, and no one cares about it as much as you do, no matter what they tell you. Forrester analyst Rachel Dines may have said it best in her blog entry from last week: “ASSUME NOTHING. Your cloud provider isn’t in charge of your disaster recovery plan, YOU ARE!” (She also lists several really good questions you should ask your cloud provider.)

Cloud technologies can solve specific problems for you, and can provide some additional, and valuable, tools for your IT toolbox. But you dare not assume that all of your problems will automagically disappear just because you put all your stuff in the cloud. It’s still your stuff, and ultimately your responsibility.

BC, DR, BIA - What does it mean???

Most companies instinctively know that they need to be prepared for an event that will compromise business operations, but it’s often difficult to know where to begin.  We hear a lot of acronyms: “BC” (Business Continuity), “DR” (Disaster Recovery), “BIA” (Business Impact Analysis), “RA” (Risk Assessment), but not a lot of guidance on exactly what those things are, or how to figure out what is right for any particular business.

Many companies we meet with today are not really sure what components to implement or what to prioritize.  So what is the default reaction?  “Back up my Servers!  Just get the stuff off-site and I will be OK.”   Unfortunately, this can leave you with a false sense of security.  So let’s stop and take a moment to understand these acronyms that are tossed out at us.

BIA (Business Impact Analysis)
BIA is a process through which a business will gain an understanding from a financial perspective how and what to recover once a disruptive business event occurs.   This is one of the more critical steps and should be done early on as it directly impacts  BC and DR. If you’re not sure how to get started, get out a blank sheet of paper, and start listing everything you can think of that could possibly disrupt your business. Once you have your list, rank each item on a scale of 1 - 3 on how likely it is to happen, and how severely it would impact your business if it did. This will give you some idea of what you need to worry about first (the items that were ranked #1 in both categories). Congratulations! You just performed a Risk Assessment!

Now, before we go much farther, you need to think about two more acronyms: “RTO” and “RPO.” RTO is the “Recovery Time Objective.” If one of those disruptive events occurs, how much time can pass before you have to be up and running again? An hour? A half day? A couple of days? It depends on your business, doesn’t it? I can’t tell you what’s right for you - only you can decide. RPO is the “Recovery Point Objective.” Once you’re back up, how much data is it OK to have lost in the recovery process? If you have to roll back to last night’s backup, is that OK? How about last Friday’s backup? Of course, if you’re Bank of America and you’re processing millions of dollars worth of credit card transactions, the answer to both RTO and RPO is “zero!” You can’t afford to be down at all, nor can you afford to lose any data in the recovery process. But, once again, most of our businesses don’t need quite that level of protection. Just be aware that the closer to zero you need those numbers to be, the more complex and expensive the solution is going to be!

BC (Business Continuity)
Business Continuity planning is the process through which a business develops a specific plan to assure survivability in the event of a disruptive business event: fire, earthquake, terrorist events, etc.  Ideally, that plan should encompass everything on the list you created - but if that’s too daunting, start with a plan that addresses the top-ranked items. Then revise the plan as time and resources allow to include items that were, say, ranked #1 in one category and #2 in the other, and so forth. Your plan should detail specifically how you are going to meet the RTO and RPO you decided on earlier.

And don’t forget the human factor. You can put together a great plan for how you’re going to replicate data off to another site where you can have critical systems up and running within a couple of hours of your primary facility turning into a smoking hole in the ground. But where are your employees going to report for work? Where will key management team members convene to deal with the crisis and its aftermath? How are they going to get there if transportation systems are disrupted, and how will they communicate if telephone lines are jammed?

DR (Disaster Recovery)
Disaster recovery is the process or action a business takes to bring the business back to a basic functioning entity after a disruptive business event. Note that BC and DR are complementary: BC addresses how you’re going to continue to operate in the face of a disruptive event; DR addresses how you get back to normal operation again.

Most small business think of disasters as events that are not likely to affect them.  Their concept of “disaster” is that of a rare act of God or a terrorist attack.  But in reality, there are many other things that would qualify as a “disruptive business event:” fire, long term power loss, network security breach, swine flu pandemic, and in the case of one of my clients, a fire in the power vault of a building that crippled the building for three days.  It is imperative to not overlook some of the simpler events that can stop us from conducting our business.

Finally, it is important to actually budget some money for these activities. Don’t try to justify this with a classic Return on Investment calculation, because you can’t. Something bad may never happen to your business…or it could happen tomorrow. If it never happens, then the only return you’ll get on your investment is peace of mind (or regulatory compliance, if you’re in a business that is required to have these plans in place). Instead, think of the expense the way you think of an insurance premium, because, just like an insurance premium, it’s money you’re paying to protect against a possible future loss.