Continuous Availability and the Art of Failing Gracefully
When Google went down for five minutes back in 2013, the online community panicked. All of Google’s services and applications were gone. Twitter lit up like a Christmas tree, with tweets ranging from anger and frustration to threats of apocalyptic doom. Five minutes offline, and Google sent the Internet packing.
Since becoming the powerhouse it is, this was only the second time that Google had crashed. That’s some pretty impressive performance. But still, no matter how well-built any machine is, it won’t last forever. This is particularly true with perpetually running hardware and software. It is only a matter of time before some component or service fails, and without the right safeguards in place, the results can be disastrous. While it’s commonly accepted that some downtime will occur eventually, data loss or complete loss of services can ruin reputations and even destroy companies.
Planned and Unplanned Disruption
Of course, planned outages can be just as disruptive. System maintenance, operating system upgrades, hardware replacement – these necessary operations are usually scheduled to occur when users will be least affected, but as global and distributed as our environment has become, there really is no convenient time for a system to be down. Also, there’s no guarantee that just because the downtime is scheduled, it won’t have a damaging impact.
To mitigate the amount of havoc a failure can wreak, some common techniques are often applied to traditional client-server architectures. Replication, both of servers and application data, is widely used to prevent data loss, increase uptime and in some situations improve performance. Some systems employ a hot standby, where data from the primary database is backed up to a secondary database. These techniques support what’s known as high availability, and they have been embraced by the industry as being “good enough.” Nevertheless, when a system fails over to a hot standby, it usually takes minutes before it is back in action and even if it doesn’t disappear, users will experience a degraded service for a while.
It’s About Service
Service level expectations are changing. It seems unfair, in a way, but ever since the advent of the cloud computer, users have compared the performance of cloud services with what their IT department could provide. Whereas downtime used to be an annoyance we accepted, it no longer seems so acceptable. Yes, even Google goes down at times, but only for five minutes, and only once every few years.
And now, the prevalence of real-time transactions and streaming data has ushered in an always-on environment. When we consider competitive industries such as telecoms, cable and media streaming, any blip of poor service leads to angry customer complaints, and any persistence of poor service leads to customer churn. Business time means real time. It requires more than high availability; it requires continuous availability.
IT shops realize that fail-over capabilities are necessary, and most data centers are awash with redundant resources (servers, storage, databases, power supplies, etc.). From a business perspective, however, delivering bulletproof, iron-clad applications is both costly and time-consuming. Backup components and resources are expensive, and they will spend most of their time doing nothing while chewing up electricity and being cooled. And traditional architectures, even when specifically designed for fail-over, do not always recover swiftly.
To deliver continuous availability in a way that best supports the business, modern software architectures must be designed to be not just resilient, but effective with resource management and operational intelligence. Instead of the oft-employed redundancy model of keeping a rarely needed copy of everything, why not run multiple active components in multiple locations?
Nowadays it’s possible. And when software is deployed in such a distributed manner, each node is capable of any activity that any other node can perform, and each node is actively primed and ready to do so. If you build services on top of a system that allows for resource monitoring and failure prevention, then you can further increase uptime, performance and availability – in fact, you can deliver continuous availability.
Consider this example. A company is running a distributed database across four data centers in the U.S., one in each time zone. If the east coast data center experiences a power outage, requests to the database would be sent to the nearest working database, in this case, the one in the central time zone. Despite the unavoidable failure, there is no interruption of operational availability because the other databases are active. What we are describing is a truly distributed architecture and it’s real.
For those who want it, continuous availability, even for complex database applications, is feasible.
Dr. Robin Bloor is the Co-founder and Chief Analyst of The Bloor Group. He has more than 30 years of experience in the world of data and information management. He is the creator of the Information-Oriented Architecture, which is to data what the SOA is to services. He is the author of several books including, The Electronic B@zaar, From the Silk Road to the eRoad; a book on e-commerce and three IT books in the Dummies Series on SOA, Service Management and The Cloud.