You are here

Different Views on Availability

One of the interesting things that happens when you work on a distributed system is that you start to see things from different perspectives. We'll get to that in a minute. By way of background, let's talk a little about how we got to where we are today.

The early client-server databases were designed to run on a single machine. This made a lot of sense given the challenges and resource limitations at the time but has (at least) two significant drawbacks. First, if you want to support more users or increase your transaction rates you need a bigger system (that's vertical scale). Second, the single server is also a single point of failure, leaving the system vulnerable to down-time and data-loss and making system upgrade disruptive.

These two ideas are directly related. If you could change from traditional models to scale-out architectures you might get away from the single point of failure and introduce a way to replicate durable state. If you could design with redundancy from the start then you'd probably get some form of scale-out capability for free. As it turns out, however, trying to inject either goal into traditional architectures is hard, typically resulting in at least one of high latencies, limitations on the transaction model, unresolved conflict or windows for data loss.

The thing is, an operational database is the core of most running systems. It has to be available for everything else around it to function correctly. Even if you don't care about scaling out your transaction rates you do care about getting a response when you ask for data. As an industry, therefore, we've been really good at finding pragmatic ways to increase the availability of databases. If you want to hear my take on these approaches and how NuoDB is significantly different check out the whitepaper on continuous availability that we published last week.

Ok, now back on that theme of seeing things from different perspectives, I've come to believe that when we talk about "database availability" we're conflating several issues. At the very least, there is a question of data being available that's separate from availability of the service that gives you access to that data. Many of us are used to equating a database with the disk where data is stored, or assuming a database is transient with data available only when the service is running. As new architectures evolve with distribution & redundancy in their core it's interesting to step back and think about these questions from a different perspective.

Data Availability

In a traditional RDBMS data availability relies on having access the disk where the database is made durable. As long as the disk (or disk array, or NAS etc.) is available then the data is available to the software and can therefore be made available to applications. If the disk fails the data is no longer available and you may have no choice but to recover from a backup or previous snapshot.

Following that view, it makes sense to maintain a replica of the data. Essentially, every time you update the primary state on disk make the same change to a copy at another location. Now if you "lose the disk" you'll have a window of time where you can't get at your data but eventually you can get back to an available state by switching to use the replica as the new primary. The catch, of course, is whether the replica has all of the data in the correct state. Synchronous replication guarantees a lock-step copy but at the cost of additional latency for each transaction. That may be acceptable in tightly-coupled networks but if you're replicating to a remote data center those WAN latencies are probably unacceptable.

In a distributed deployment, by contrast, there are likely to be many users updating disjoint data at the same time. For instance, in social applications there are natural clusters of activity, and if the users are geographically distributed then the access patterns will also be distributed. Assuming these scenarios it's natural to split storage from a single location to multiple, disjoint locations. Now, failure of a single disk can make a subset of the overall data unavailable. In some systems that may be unacceptable but others could tolerate this if they don't need access to that information at a given point in time.

Recently I was talking with some database operators. They described a model where two data centers were both actively updating a database, but where they knew that many updates were safe to make locally as long as some global state was set between the two data centers. In the NuoDB architecture this is pretty easy to address, so we were talking through their abstract requirements and experiences. The conversation kept coming back to the assumption that as long as the global part was written to disk in both locations then it would be available to local services. I had to keep arguing that replication of data isn't the same as data availability, especially in the case where the local operations can run in-memory even if the durable service in the data center fails. Old habits die hard.

So what's the point? In the data management architectures we're building today we need to look at the rules a little differently. Increasingly we're focusing on in-memory capability, asynchronous communications and global scale. Durability is not the same as availability. Replicating durable data may give you a heightened feeling of safety in the case of catastrophic failures, and it's definitely an important element of any system. Data could still be available, however, even when you can't get to its durable form at-rest. What's critical is to understand the data availability model first and then use that to define the availability of your data service as a whole.

Service Availability

Knowing that all of your data is available to the database is great. If that database isn't completely available to the applications that rely on it then that's no so great. In other words, if the service isn't available then the data also isn't available.

Again, starting with a single host running a database server, availability is a pretty simple thing to think about. The service is either there or it's not. As a service scales out, it my be addressing one or more key capabilities. A replicated service, for instance, may only be available as a hot-standby or a source for read operations. A service that scales via sharding the underlying data, or by limiting which set of activities it will accept at any given endpoint may at any given time be providing partial availability to the data as a whole but appear fully available as a service.

There are plenty of different ways that people think about service availability. It goes without saying that topics like CAP factor heavily into it. I wrote about this a while back so I won't re-hash my thoughts here. I will say that, in practice, what I believe most people want is a system that survives failures by providing complete availability if at reduced capacity. By complete availability I mean no loss of functionality and complete availability of the underlying data.

The traditional coupling of disk & service in the database community makes it harder to think about availability this way. Making it harder still are all the models for consistency, rules around visibility and requirements of durability. What we're doing here at NuoDB is teasing apart where the data is available and how the service uses that to provide availability to applications.

Conclusion

This post came out of several recent discussions. Architectures and software are changing at a rapid pace, and global cloud capacity is helping to drive it. Familiar notions of what makes data or a service available are nothing like what they were even five years ago. Take this short entry not as a treatise on the subject but as a challenge to go off and think a little about what challenges you're looking at today. How would you like your systems to evolve and what are the trade-offs that you need to face? Whether you're building something brand new or modernizing legacy systems these are the questions you should be able to answer, or at least have some theories about. Fun times! 

Add new comment