Sizing Up the Distributed Database
To many long-time enterprise administrators, the idea of a distributed database is unnerving at best. Instead of having all your data, the database platform and the supporting hardware in one place where you can keep an eye on it, the architecture is spread across multiple locations, and almost always on a diverse, heterogeneous footprint.
But as the broader data environment becomes more distributed, it is only natural that the database should as well. And as it turns out, the enterprise has virtually nothing to lose with distributed architectures and a whole lot of scalability and flexibility to gain.
BuildFax’ Joe Masters Emison noted recently that if your applications have been sitting atop the same platform for 10 years or more it’s time to get into the new decade – because you can be sure your users certainly have. In this day and age, downtime is not acceptable, and neither is latency, non-availability, weak performance or the many other detriments that are considered normal in most legacy environments. With Big Data on the way, the enterprise needs to start providing solutions that go right to the heart of the business model, and the fact is that simply expanding yesterday’s data infrastructure, reliable as it may be, is not enough to meet the challenge, at least not within a reasonable budget.
Of course, scale is not the only benefit that distribution brings to the table. Security and redundancy are also enhanced by advanced replication and the elimination of the single point of failure. This gives the enterprise the ability to maintain master data at the highest level of security while providing multiple versions to numerous points in the cluster in case key capabilities go down. As well, the ability to access data from numerous locations enhances productivity, particularly in high-traffic environments, as does the ability to harness multiple resources in parallel for heavy workloads.
A good example of a distributed architecture in action is Google’s new Mesa warehouse, says Enterprise Tech’s Timothy Prickett Morgan. The system is spread across literally tens of thousands of servers around the world and is dedicated exclusively to processing data for the AdSense, AdWords and DoubleClick businesses. It runs on the Colossus distributed file system and the Big Table data store, supplemented by MapReduce and the Paxos synchronization protocol. Performance is measured in millions of row updates per second and billions of queries per day. To avoid downtime, Mesa has a unique versioning system that matches versions to queries so data can be streamed in continuously while preventing partially finished updates from reaching an application.
Of course, building a distributed database architecture is one thing, operating it is quite another. Many leading platforms these days have all manner of query optimization, update automation and other tools to simplify management, but operators will still need to familiarize themselves with techniques like the “two-phase commit,” preferred by Oracle and others as a means to prevent data corruption should a server crash in the middle of a transaction. As well, the process of linking tables at multisite locations tends to be complex and time-consuming.
Most of these issues crop up when trying to force-fit distributed architectures onto today’s centralized infrastructure, however. Newer platforms like NuoDB are being designed around advanced memory-centric infrastructure that enables rapid and extensive scalability and unique caching opportunities that avoid the need for synchronous commit schemes and complicated resource-sharing architectures. The Durable Distributed Cache (DDC) platform, for instance, enables broad distribution on commodity hardware through the use of in-memory container objects (Atoms) that are treated as peers by the management stack. In this way, all objects can be shared equally and database environments can be established with no single point of failure and full support for ACID semantics and other tools.
To paraphrase John Donne, “No datacenter is an island.” While containing the database within a singular computing environment may have served the needs of past generations, todays loads will quickly swamp even the largest facility.
Distributing virtual architectures across disparate infrastructure is not only fashionable these days but vital. And nowhere is this more critical than in database environments about to run headfirst into Big Data analytics.