Amazon Downtime – Designing for Failure

If the carefully scrutinized public cloud service outages tell us anything, it is that it’s time for us to think differently about databases on the cloud.  Very differently.

As relates to disaster recovery of databases, public cloud customers need three things:

  1. Safe Data Guarantees
    To have live, fully up to date and fully consistent copies of all your databases in a location of your own choice.  That might be your corporate datacenter, a portable USB drive or an archive facility in a bunker under Nebraska.  It might be more than one location.
  2. Continuity of Service
    To have a database system that runs concurrently in multiple datacenters and/or cloud availability zones with guarantees of consistency in all locations, and resilience to failure of any of those locations.
  3. Capacity Recovery
    To have the ability to add computers to a running database system to rebuild capacity that may have been lost due to a datacenter or region going down. And to have a database that can restart rapidly in a new location from a Safe Copy of the database (see point 1), should all datacenters fail.

Those are the primary requirements for business continuity at a database level on public clouds.  Current database systems don’t come close to delivering this because they are designed for large dedicated servers in locked down datacenters, not the fluid and unpredictable world of public clouds.  We need a Big Idea on how to build databases differently so that they can deliver these capabilities.

Amazon has been a great pioneer of what we now call Public Cloud services, and IaaS in particular.  They have shown the way in delivering pay-by-the-drink computing capacity, storage services, data management services and a lot more.  And in support of this they have built a vast infrastructure with sophisticated resource sharing systems, load balancing systems and high availability mechanisms.  The Amazon public cloud is significantly larger and more sophisticated than any other offering from the many other vendors.  Significant credit goes to the onetime online bookseller for their leadership in the cloud computing revolution.

And yet …

One of the things that has become apparent is that systems failure in the mega-datacenters of the cloud can be catastrophic, in the original sense of the word.  Last April we saw how router problems can lead to network overload from remedial re-routing of network traffic, which in turn leads to software latency problems (apparently in the EBS subsystem), which can cause application load-balancers to favor a subset of machines, which can create system overload which can bring down datacenters in whole or in part: http://www.syracuse.com/news/index.ssf/2011/04/amazon_failure_takes_down_site_1.html

Thunderstorms have been responsible for power loss on multiple occasions, including in Dublin Ireland in August last year:

http://www.datacenterknowledge.com/archives/2011/08/07/lightning-in-dublin-knocks-amazon-microsoft-data-centers-offline/

And now, most recently, we have seen service-affecting power outages in Virginia last week:

http://www.huffingtonpost.com/2012/07/02/amazon-power-outage-cloud-computing_n_1642700.html

The Huffington Post article concludes: “Most people who are savvy about the cloud understand it’s not infallible.”

In all of this the natural question for any customer of Amazon Cloud Services is how to plan for these failures.  No matter how good the failover models are and how much better Amazon and other cloud vendors get at building failsafe systems (and of course they will get better over time), customers need to have business continuity plans in place.  These plans must that assume that:

a)    The cloud services may go down at any time, potentially for hours and perhaps longer.

b)   Access to your cloud databases could be blocked.

c)    You do not have strong guarantees about permanent loss of data.

As we build systems for these shared cloud environments, and as we think about the new age database systems to support those systems, it is worth standing back and asking the obvious question.

What if we could build a database management system that could naturally support the three big business continuity requirements of:

  1. Safe Data Guarantees
  2. Continuity of Service, and
  3. Capacity Recovery?

Can such a system be built?  Is it even possible?  There is good news.  NuoDB has that Big Idea.  When we talk about 100% uptime this is exactly what we’re referring to.  The new age of web-scale data management needs a new architecture.  If you’re interested in what the 21st century database will look like, I invite you to download the latest beta of NuoDB.

http://www.nuodb.com/download

Add new comment