Can We Finally Find the Database Holy Grail? Part 2
A truly distributed DBMS with ACID transactional guarantees could address the key pain points in modern database management.
In Part One of this three part series, I talked about distributed transactional databases being the Holy Grail of database systems. Among other things, the promise of such systems is to provide on-demand capacity, continuous availability and geographically distributed operation. But the historical approaches to building distributed transactional databases have involved unacceptable trade-offs, and as a result general purpose database systems predicated on a single-server architecture have dominated the industry for decades.
In our quest for the Holy Grail of databases we acknowledge that there are folks who have given up on the highly desirable characteristic of transactional consistency in favor of a distributed operation. That is a trade-off that may be attractive if you can’t find a way to scale-out transactions, but it’s a drastic choice that moves a lot of complexity and cost up the application stack. If there is a way to scale out transactional databases, then that is clearly a much better outcome. Our Holy Grail discussion is specifically about distributed transactional databases because if such a thing can be built without a substantial downside then no one would want any other distributed data store.
Is there a way to build one of these things? I am excited by the potential of a new design that we call a Durable Distributed Cache, but I’d like to first lay out the three categories of designs that have been proposed historically.
The idea of Shared-Disk designs is fairly self-explanatory. You have several machines in a cluster all loading and storing pages of data from the same shared disk subsystem. A good example of a Shared-Disk system is Oracle Real Application Clusters. The database system tracks updated pages, writing them to disk, tracks “stale” pages, invalidating them, and implements some kind of a network lock manager that coordinates concurrency conflicts on a distributed basis. Per our example above you need to be very aware of the nature of your workload.
Workloads that require minimal coordination can scale fairly well on Shared-Disk systems. In the extreme case, consider an all-read workload. This is likely to be I/O-bound before it is limited by inter-machine coordination delays. The I/O-bound issue can become a problem, because you don’t increase I/O when you add more machines. If your database performance is limited by disk I/O and not by computational bandwidth then there is no point adding more machines.
But coordination is often required between machines, and in these cases Shared-Disk systems are often limited by lock-based coordination protocols. The core database management system (DBMS) design is typically built around the idea of efficiently managing tightly coupled memory and disk pages. Consequently, the introduction of synchronous distributed locking to guarantee page ownership introduces both latency problems and thrashing problems that can easily bring these systems to their knees. Careful allocation of transactions to machines to maximize affinity can ameliorate these issues to some extent, but this increases the fragility of the systems and adds substantially to long-term DBA expenses.
Shared-Disk systems are at their best for workloads that are highly read-dominated, or in which the data can effectively be partitioned by machine.
These are both arguments in favor of Shared-Nothing designs. Shared-Nothing is really a cute way of saying “partitioning.” In other words you essentially run multiple databases, carefully arranging that part of the dataset that resides in each partition. To take a trivial example of a database containing all US citizens, you might put everyone with last names starting with A-L in one partition and everyone else in the other partition. Or you could put all male citizens in one and all female citizens in the other. For each incoming transaction you detect which data it needs to access and then send it to the relevant partition.
You might ask why, if you are essentially just running multiple databases with the data split between them, then why is it a DBMS issue at all? Surely the application can just do that itself, using any traditional client/server database system? The answer is that of course you can do that, and it is called “Sharding.” Big Internet applications, like Facebook, do this, and in some cases they run many thousands of shards. Shared-Nothing distributed databases try to do this transparently for you, in particular striving to help you with the really hard problems of how to optimally partition the data, how to perform transactions efficiently when they touch data in multiple partitions, how to deliver transactional semantics across partitions, how to rebalance the partitions as the database grows, and so on.
Once again, partitioned stores can be very effective for certain kinds of workloads but you have to be very careful to arrange your queries and data layout to make sure it’s optimized. For some cases there is no good way to partition the data. In all cases there is a limit to how many partitions you can profitably create, which means there is a limit to the degree that the system can scale out. If you are running a workload that has any appreciable proportion of cross-partition transactions, you will most likely need specialized low-latency interconnects between your servers. And the partitioning model is not very dynamic, so it does not address the desire for on-demand capacity management.
One last comment on partitioned stores: They are not all disk-based. In-memory systems typically also use partitioning in order to exploit multiple database servers in parallel.
Synchronous Commit Databases
There is a third traditional way of building distributed databases. You could call it the obvious way, the brute force way or, more technically, Synchronous Commit (Google calls it “Synchronous Replication”).
In a transactional DBMS, an application makes changes to the data before committing or rolling back the transaction. If the transaction is successfully committed then the state of the database has been (atomically) updated with the full set of committed changes. If the commit fails, then the canonical state of the database is unchanged. And if the application performs a rollback then the uncommitted changes are discarded. The Synchronous Commit approach has been adopted in various forms over the years, not least in the form of Two Phase Commit (2PC) protocols.
The idea of Synchronous Commit is simply that when an application makes a request for the DBMS to commit a transaction, the DBMS performs the commit in multiple locations before returning success. The best modern example of Synchronous Commit is Google F1. F1 is the database system that Google uses to support Google AdWords, and which will support many more Google applications in due course. Self-evidently it works for Google, and for a very demanding application. Google specifically notes that they could not address the AdWords challenge using NoSQL technologies or MySQL approaches.
Although Google has made it work, the disadvantage of Synchronous Commit is stated here in the Google F1 White Paper: “Synchronous replication implies higher commit latency, but we mitigate that latency by using a hierarchical schema model with structured data types and through smart application design.” Also F1 relies on state-of-the-art Google networking to reduce inter-node latency, and the existence of atomic clocks on every participating machine in order to support a common notion of true time. The obvious challenge with Synchronous Commit is latency (in other words delays). There are some applications for which throughput is more important than transaction latency but for most modern applications the reverse is true.
A less obvious challenge relating to Synchronous Commit is managing error and failure conditions. Designing distributed synchronous commit protocols is relatively straightforward if you can assume 100% reliability. It is recovery detection and recovery procedures that are the hard part of the problem, and the designer is generally left with a trade-off between maintaining transactional guarantees for a large number of recovery scenarios, avoiding significant performance degradation, and limiting implementation complexity. In consequence of the latency issues and failure management challenges the Synchronous Commit approach has had limited success historically.
It is clear that a truly distributed DBMS with ACID transactional guarantees could address the key pain points in modern database management, notably on-demand capacity, continuous availability and geo-distributed operation. The three traditional designs unfortunately deliver much less than those promised advantages and involve costs, complexities and/or functional limitations that limit their usefulness and general applicability.
In my next post, I aim to introduce a new design approach, one that we call Durable Distributed Cache. By stepping back and rethinking database design from the ground up Jim Starkey has come up with an innovative solution that makes very different trade-offs. The net effect is a system that scales-out/in dynamically on commodity machines and virtual machines, has no single point of failure, and delivers full ACID transactional semantics.