Understanding the Whens and Whys of Database Sharding

Database sharding is an imperfect solution to a seemingly insurmountable problem. Let’s take a look at why.

Sharding is a technique where you scale out a database by segmenting your data across multiple databases. Let’s assume for a moment that you have a running database with all of your users in it, last names A through Z. It’s been performing well, but as you grow, the database grows as well. Soon, the load overwhelms the database such that it can’t handle all the requests. 

Limitations of Scaling Up Your Database

Traditional SQL databases were designed to be scale-up databases, which means they were designed to go faster or handle more throughput by moving to a larger, more powerful piece of hardware. If you have the budget and patience for that, it can work well. Unfortunately, the cost and downtime required typically makes this solution very unattractive. 

What most people want to do is be able to scale out versus up. Rather than replacing one big machine with an even bigger machine, they want to be able to elastically scale from one server to two servers to X number of servers on demand. That’s the modern idea of scalability. 

If you’ve worked with Amazon or Google Compute Engine or any other kind of public-cloud infrastructure, or any private cloud, this is what you’ve done. You’ve started with one host, or a small number of hosts, and as you’ve needed capacity, you’ve brought it online. Realistically you cannot always know when there are going to be spikes in demand. While you may be able to predict that a particular holiday will bring with it heavy traffic and plan accordingly, it’s just as often that you may be surprised by the sudden demand. You want to be able to bring new capacity online as you need it and do it in a hurry.

Is Sharding the Solution?

A common solution to this problem has been to split the database into two: Database 1 will include everyone with last names A through L, and Database 2 will include last names M through Z. And with that decision, we’ve begun what’s known as “database sharding”.

Let’s look at the advantages. Having gone from a single database to two different databases, we now have two machines with twice the memory resources, twice the CPU power, and twice the disk IO throughput. The result? Now we can handle twice the number of client connections. 

The problem, of course, is that now we have two separate databases. 

The Challenges with Database Sharding

A sharded database typically requires significant changes to the way your application works, so that your application knows which database a given piece of data resides. Let’s look at what this looks like in practice. 

Splitting our dataset between two or more different databases has helped us scale out, but what happens when we want to do a query that involves someone in Database 1 and someone in Database 2? You no longer have a single transaction that can span those two sets of data. The scope of your transaction is limited to either Database 1 or Database 2 - there is no intersection. This programming model is now very closely tied to our deployment model – i.e., tied to where our data lives; how it is sharding out; how many hosts we’ve brought online. 

This makes it difficult for developers, because developers have to think very carefully about their application logic, how to best join data across shards, how sharding relates to their infrastructure, and how they expect that infrastructure to scale. It means that, from your operations point of view, you have to think very carefully about how you’re sharding, how you’re scaling out, and whether it’s going to have adverse effects on your application, or whether your application can even support the kinds of scale that you’re trying to reach. It’s also fragile. You might have components failing, and you might lose pieces of your data; you really want to think about this as a logical database, but you can’t.

As you can see, database sharding has partially addressed a very challenging problem - the need to scale to address demand. But as outlined above, this approach also introduced fragility, development complexity, and operational difficulty. In addition, it is fairly difficult to shard on demand, which means that this imperfect solution still doesn’t address the need to add capacity during unforeseen peaks.

In my next post, I’ll discuss an alternative to sharding. In the meantime, take a look at this video that highlights how to build a scalable application without sharding.

Add new comment