CTO Seth Proctor walks through a database sharding example and explains why it's just not necessary to shard a NuoDB database.
(Seth Proctor): Hi, I’m Seth Proctor. I’m the CTO at NuoDB. It probably says that, like, right here. And today, I’m going to be talking to you about a problem you’ve probably come across if you’ve tried to build databases in the real world. And that problem is sharding. I’m going to talk a little bit about what sharding is, why people do it today, and why, here at NuoDB, we think it’s a frustrating thing to have to work with. And we’re trying to free you from having to ever implement another sharding system again.
WHAT IS SHARDING?
So, what is sharding? Sharding is, basically, a technique where you can scale out a database by segmenting your data across multiple databases. So, for instance, let’s say you’ve got a database running. And it’s got all of your users in it, last names A through Z. And it’s running great. And then, you start having so much load on it that it’s overwhelmed. It can’t handle all the requests. So, one thing you might do is, you might say, “Let’s split this into two separate databases. And in this database, we’ll put everyone with last name A through L. And in this database, we’ll put everyone with last name M through Z.”
So, what I’ve just done is, I’ve gone from a single database to two different databases. And that means I’ve got two machines. I’ve got twice the memory. I’ve got twice the CPU. And that’s great, because now I can handle twice the number of client connections. The problem, of course, is that I’ve now got two separate databases. Now, why did we have to do this to begin with? Traditionally, SQL databases were designed to be scale-up databases. And what that really means is, they were designed to go faster or handle more throughput by moving to a larger piece of hardware. And if you have the budget for that, and the patience, that’s great.
But most people, today, don’t want to scale systems that way. It’s very expensive. It’s very costly in terms of downtime. And it means always having to provision and buy your own systems. And what most people want to do is, they want to be able to scale out. They want to be able to go from one system to two systems to some large number of systems. And they want to be able to bring online many smaller machines, not replace one machine with a bigger machine. And that’s really the problem we have today.
What are the problems with sharding?
So, the way people want to architect systems today is with this kind of scale-out model. If you’ve worked on Amazon or Google Compute Engine or any other kind of private-cloud or public-cloud infrastructure, this is what you’ve done. You’ve started with one host, or a small number of hosts, and as you’ve needed capacity, you’ve brought it online. And, typically, people want to be able to do that on demand. You don’t necessarily know when there are going to be spikes. Sometimes you have the luxury of being able to plan ahead. You might know that, during the holiday season, your website is going to get slammed. But, often, these things are surprises. And so, you want to be able to bring new capacity online as you need it.
That’s one of the challenges with this kind of approach -- is that it’s fairly hard to shard on demand -- to figure out how to break something apart as you need it. Another real problem here is that what I’ve just done is I’ve split my dataset between two different databases. And while that helped me to scale out, it means that, if I want to do a query that involves someone on this database and someone on this database, I no longer have an ACID transaction that can span those two sets of data.
And so, the scope of my transaction is limited either to this database or this database. And so, what I’m really saying is that the programming model -- the way you think about how you can access your data consistently, is now very closely tied to your deployment model: where your data lives; how you’re sharding out; how many hosts you’ve brought online.
This makes it difficult for developers, because developers have to think very carefully about their application logic, how it relates to their infrastructure, and how they expect that infrastructure to scale. It means that, from your operations point of view, you have to think very carefully about how you’re sharding, how you’re scaling out, and whether it’s going to have adverse effects on your application, or whether your application can even support the kinds of scale that you’re trying to reach. It’s also fragile. It also means that you might have components failing, and you might lose pieces of your data when you really want to think about this as a logical database, but you can’t.
One common problem to addressing that is replication -- is saying that, “I might, actually, bring online two databases that support N through Z, for example. And I’ll do my updates here. And then I’ll replicate updates over here.” And so, that helps me, in terms of high availability. That helps me in terms of additional parallelism. But it also means, now, that I have differences in times when this sees a version of something and this sees a version of something. So, now, my application logic has become even more complicated, trying to understand where those time lapses might be, and what the real consistency model is in failure.
So, sharding is a really good approach, from the point of view of there was a practical problem that needed to be solved, because traditional SQL databases couldn’t scale out. And this was a way of trying to get that scale-out behavior with a certain amount of fragility, a certain amount of difficulty to developers, and a certain amount of difficulty to operations.
How does NuoDB avoid sharding?
So, what have we done here at NuoDB? Well, here at NuoDB, we’ve started with an entirely new architecture for the relational database. And that architecture is designed around the idea that this is the scale-out model people want. They want to be able to start with the database on one machine; scale out as they need additional capacity, as they need additional cache, additional parallelism, support for more clients. And that replication of data shouldn’t be at the same tier as how you scale out in terms of processing power. It should be a separate concern.
And so, what we’ve done is, we’ve provided a system that scales horizontally; that is always consistent; that replicates data correctly and consistently; and in the face of failure, always fails correctly. So, you don’t lose data in any way, but where you don’t have to think about these problems -- where you still get this view here. You still get the view of a single logical database, no matter how many hosts you’re running on, no matter how many processes you have going, no matter how many datacenters you’re running in. You’re still programming to a single logical database. You’re still working as if there’s a single logical database. Which means that your developers are free to write their applications. And your operations people are free to scale the database however they need to, to take advantage of their resources. And when you need to bring things online on demand, or when you no longer need additional resources and you want to shed them, you can do that at no effect to the application itself.
So, that’s a very, very quick view into what sharding is. I apologize for the things that I missed. I know that this isn’t a complete discussion. And that’s why I would invite you to go to our website. We have a number of articles where we’ve written in much greater depth, both at our corporate website and on our technical blog. We’ve talked about what sharding is. We’ve talked about what the features are that we provide, that we think really address these problems. And what we think people want out of a twenty-first century database, to be able to exploit scale and to be more effective as developers and as systems administrators.
Again, my name is Seth Proctor. And thank you, very much, for watching.