CTO Seth Proctor explains how NuoDB provides customers with a highly available DBMS.
(Seth): Hi, I’m Seth Proctor. I’m the CTO here at NuoDB and today we're going to talk about high availability. So, last time, we talked about replication. We talked about why replication is an important topic, but why when you have to explicitly manage replication, it’s difficult, it’s error-prone, and some of the things that we do inside of our database to make it easy to scale out without having to think about all of the usual quirks around replication.
Today, we’re going to talk about something sort of related. We’re going to talk about high availability. So, high availability is something that everyone wants. It’s basically how do I make sure that my database is always available? How do I make sure that my service-level agreements are met? How do I feel confident and sleep at night, knowing that my database is there for my applications. And part of what we’ve tried to do with NuoDB is provide a really simple set of models for giving you the degree of high availability that you want. So, let’s walk through a few examples.
First, here’s a picture I drew back in our first video, and it’s a simple four-host deployment and what I showed you in that video was I showed how easy and quick it is to scale across four hosts in this kind of model.
What we’ve really done here is we provided complete redundancy. We’ve given you the ability to lose any one of these four hosts and you still have access to your transactional processing, and you still have access to your full database. And so you’ve got a very simple model for high availability with very little work. To do this in a traditional database would require deploying on multiple machines, explicit replication, writing your application, understand that mode, and in NuoDB, this is just a single logical database. So what’s nice is I can cause one of these hosts to fail. The database keeps running. In fact, not only can I cause one of these hosts to fail and still have access to my database, but if the storage manager fails, that storage manager could be writing to something like, say, S3, and if it’s writing to S3, that’s a network-accessible service, and so now, I can bring up another host anywhere in the network and I can get back to that data very easily. So that really gives you high availability. Not only that, but S3, itself, is a replicated service, and so you’re getting extra layers of durability and protection there.
So with very little effort, I’ve been able to deploy a fully redundant system and I’ve been able to store my data into something that makes it easy to get back to it, even when a particular host fails.
So as we talked about in previous videos, part of what NuoDB gives you is this scale-out model, whether you’re in a single data center or whether you’re actually scaling across multiple data centers, something we call here georedundancy or geodistribution.
What that means, of course, is that not only can you lose a single post in a database and stay active, but in fact, if you need to, you can configure your database across multiple data centers, and if you lose a complete data center, you still have your database available. Obviously, you’re running with less transactional through-put because now, you’ve lost some of your hosts, but your database is still available to you, and in a lot of cases, that’s what’s most important, right, and then eventually, when your data center comes back online, you get access to those resources again.
Now, suppose your data center doesn’t come back online, and you need to react to this failure in some way? That’s kind of the third thing that NuoDB makes really easy for you and gives you that’s very hard in traditional database.
So, let’s take a simple example here. Let’s say we’ve got a database and part of the database, we’ve got a transaction engine and a storage manager running. And let’s say that transaction engine fails. So part of what NuoDB provides you is an automation model through kind of a simple management agent that’s running on every host, and so that lets you monitor what’s going on and gives you another kind of way of dealing with high availability.
So, the first couple of examples we talked about were preprovisioning. We’re letting you assume that there will be failure and having things online already so that when there’s failure, you have enough resources kind of already running, and that’s a very traditional model.
But something we can give you that is not part of a traditional HA model is being able to recognize that failure and very quickly bring new resources online to compensate, and that might be that you already had a new host running and you just started a transaction engine on that host. In NuoDB, the time it takes to send a message to that host to say bring that transaction engine online and for it to start actually participating in the database is measured on the order of tens of milliseconds. So, that’s a very fast process. It’s something that you can do very quickly, and you can reactively get it back to the levels of service you need.
You can even automate bringing that host, itself, online. So, you can bring a new instance online. Once it’s available, you can bring a transaction engine online. Obviously, that means that in the time that you’re waiting for that host to come back, you’ve lost a certain amount of transactional throughput, but that’s a choice that you can make. You can decide how much you want to preprovision resources and how much you’d rather wait for failure and then pay, in terms of downtime, until you can bring back online the resources you need to kind of pick up the slack and get going. And that’s a choice that we give you in NuoDB that you wouldn’t get in a traditional database.
So, those are three examples of kind of traditional HA problems, when people think about there will be failure, especially in a cloud environment, and how do you deal with it.
But, high availability is an issue also when you think about planned downtime. So if you’re going to upgrade your hardware, or your operating system, or NuoDB, or if you’ve been running in one set of machines, or one data center, and you want to move where your database is running, traditionally, that would involve planned downtime. You’d get the email sent around saying, “Hey, between 2:00am and 4:00am this morning, the service will be unavailable because the database is down.” And in modern cloud environments, that’s just unacceptable. So, what NuoDB provides you is a very simple way to maintain 100% availability, 100% uptime, when doing things like upgrade a migration, and here’s a simple version of how that works.
Suppose you have a transaction engine running on this host. And what you want to do is you want to do an upgrade to this host. So, what you do is you bring a second host online and you start a transaction engine on that host. At this point, you can shut down the original transaction engine because you’ve now provisioned enough transactional throughput. You’ve given yourself the same level of availability that you already had. Now, you can upgrade this host, and if you want, you can now bring the transaction engine back online here.
Now note, if all you are doing was trying to actually move your database, you can just shut down this host because you’ve done it; you’ve moved your database, no loss in availability, no downtime. If what you wanted to do was do an upgrade to this host, you start the transaction engine again, you shut down your new host, and you’re back in the same configuration, but your software has been upgraded. And, let’s say you have 10 hosts in your database, doing a rolling upgrade through all of those hosts is just a matter of repeating this process.
And so unlike a traditional database where you really have to bring down the software, you have to have real downtime in order to do an upgrade or do a migration, or you have to somewhat explicitly-replicate your data, hope that you’ve got a correct consistent snapshot and pause all activity until you can bring clients on to the new database, NuoDB really gives you the ability to do a full live upgrade or migration with no loss of availability.
So in a nutshell, these are kind of the two aspects of high availability that we think we really help you with when you’re thinking about operations: high availability of the face of expected failures and high availability, given that you’re going to want to do upgrades and you’re going to want to do migrations.
I hope this has been kind of useful in helping you understand what you really should expect from your database. I hope you’ll walk away from this asking yourself, “Are you getting high availability, true high availability, from your database?” And, I hope you’ll go try us out and give us some feedback on what you experienced. So until next time, my name is Seth Proctor, and I hope you enjoyed this.