NuoDB Founder & Executive Chairman Barry Morris presents on NuoDB at Carnegie Mellon University.
Andy Pavlo: All right, let’s go! Today, we’re really excited to have a guest speaker. I’ve known Barry for a while. Barry is the executive chairman of NuoDB, but is still heavily involved in the company. Every time I go visit there, Barry’s there. I met Barry when I was in grad school at Brown, back before it was called NuoDB. It was called Nimbus DB, and I actually found the Nimbus DB M&Ms in my office…
Barry Morris: Oh, wow.
Andy: So, I invited Barry to come to give a talk when I was in grad school, because we didn’t actually understand what NuoDB was. And when he came, everything really clicked for me, and I understand what they were doing and why it was cool. So, I’m really happy to have Barry here come now to tell you guys about a cool system that has all of the things we talk about in a semester that’s actually running in the wild and solving real problems. So, Barry’s going to be able to talk, but feel free to stop him and ask questions as you go along, okay?
Barry Morris: Great.
Andy: Thanks, guys.
Barry Morris: Thanks, Andy, and good to be here, guys. So, yeah, so as Andy says, I’ve been involved in databases for a long time, and the story here is a little bit boring. So I’ll tell it quickly, which is that I was kind of retired and done with running companies and starting companies and all that sort of stuff, happily running a school and a -- energy company and whatever else. And Jim Starkey, my co-founder, came to me and he said, “Barry, I’ve solved this holy grail problem to do with databases.” And I said, “What is it?” He said, “I can do efficient distributed transactions. In fact, I can do those dynamically. I can do that in a sort of a scale-out fashion.” And I said, “No, you can’t. I know enough about this stuff to know that there’s lots of people that have tried to solve that problem. There aren’t really that many solutions. All the solutions are highly compromised. Sorry, Jim, that’s not true.”
And he spent a few weeks persuading me and talking to me about all the stuff that I’m going to talk to you about today. And I said, “My word, that’s quite amazing what you just described. That’s the most elegant thing I’ve seen in a long time.” I shut down all the other things I was doing, jumped in. We took this thing, we patented what I’m about to talk to you about, which, by the way, we got in less than a year with no office actions. They said, “There’s no prior art. No one’s ever done anything like this before.” And we now have backers which include the founder and CEO of Ingress, who’s also the founder and CEO of the Post-Gres company, Illustra, the founder and CEO of Informix, all of these other investors, the former CEO of Sybase -- pretty much all of the relational database company big guys other than Larry Ellison, and I don’t think I’m going to be able to get him onboard. So, this is a big thing. All of those people went through this same thing I went through, which is at first, they said, “No, no, no, no, no, no, you’re either fooling us or you’re fooling yourself or something.” And it turns out that what Jim’s come up with is a really interesting set of trade-offs around distributed transactions. No one else is doing it this way.
Where we are today is we are displacing the major relational database companies in major corporations around the world: the big banks, the big stock exchanges, lots of the big technology companies are ripping out the relational database products and replacing it with a cloud-style relational database, which is what this is. We call it a durable distributed cache. I just had a conversation with Andy. He likes to think of it as a shared disk system. We probably disagree on that. Not going to show you the demo. Let me just jump straight into the question of what this is all about. And by the way, just these -- this presentation’s not a super academic presentation, so hopefully it’s relatively lightweight, but I think it will be quite challenging for you technically. So, we’ll take it step-by-step. The first point is just what are we talking about? We’re talking about a database system where you add nodes, it goes faster. And you take nodes away and it doesn’t go as fast. Simple as that, okay? That’s really hard to do with a relational database system. What you’re looking at is something that’s going from -- it’s actually quite an old chart, but it makes the point -- it’s going from one node to two nodes to four nodes to eight nodes to six nodes -- sixteen nodes. And that’s -- goes from, like, 60,000 transactions per second to 800,000, something like that. Fairly linear scale for this particular workload. Your mileage will differ. Down the bottom left, you’ll see the latency, database-level latency. It’s running at a lot less than a millisecond in terms of latency. That’s because, as you will see, it’s basically an in-memory system.
And so, this is what it’s about. By the way, you could take away any of those nodes and it’s just going to go slower. And I really mean, any of those nodes, including that first one or the first four or whatever. These are peer-to-peer nodes. You add them, you take them away, it goes faster, it goes slower, it’s resilient to failure, and so on. That’s what elastic SQL is. When you look today, you’re seeing more and more other companies starting to try and build this. You’ll be familiar with Google Cloud Spanner. It’s a version of this using a very different approach, much more of a kind of a heavy-handed approach to try and get there. There’s Cockroach and there’s others that are now trying to get into this space. We’ve been out there for a few years, and being very successful. So, I put this up here just as a kind of a point, which is I find sometimes it’s quite hard to explain the system to database people. They’re the worst people to try and explain the system to, because they already know everything.
Barry Morris: And I think of it -- I don’t -- I wasn’t there, but I’m guessing that the guys that first introduced jet engines came along and they said, “I’ve got this engine and it’s phenomenal. It can fly at 60,000 feet without a turbocharger, and it can fly faster than the speed of sounds, and it can accelerate a 300-ton jet liner to take-off speed in 40 seconds. And the other guys go, “No, you can’t. You know, to do that, your valves would float or your piston frequency would be too high, or you’re going to blow your bearings or something.” And the jet engine guy goes, “No, no, no, you don’t understand. We don’t have valves, we don’t have pistons, we don’t have any of that stuff, okay?” Yeah, we do have some of the things that are similar. In this case, for example, we do have a compression phase and we do have an exhaust phase and all the other things that you have in a traditional engine. But it’s different. And it receives the same -- gets the same end, but it’s different. And so, why do I say this? Because I really want us today -- I want you to, for the first few minutes, to forget everything that Andy’s told you, okay? Much as it will be useful later in the presentation, please start and walk -- work with me on what this thing really is. By the end of it, trust me, you will understand it, and then you’ll have much smarter questions. But do ask questions along the way, because you will have some. But I think what you’re going to find with a lot of it: I’m going to say, “Hold on to that question.”
So, what are the fundamental principles in the system or the fundamental components in the system? Two things: things we call engines and things that we call atoms. And understanding -- if you don’t understand those, is -- there’s not much point listening to the rest of what I’ve got to say. So, let’s try and figure out what those are. Engines, very simply, are processes, Linux processes, if you want to think of it that. An engine is a Linux process. A database consists of some number of those things running. They’re all peers of each other. There’s no central anything, there’s no supervisor, there’s no authority anywhere. There’s no central name server. It’s just a bunch of processes, peer processes that are in some sense collaborating with each other, okay? When customers think about it, they think, well, what we’re talking about is a machine or a VM or a container or something. Yeah, but what it really is, is these processes. You could run them all on one machine. Not a good idea. You can run them on lots of machines; you can split them between some number of machines. Engines are just -- the nodes in the system are these engines that are basically processes. They’re very asynchronous. They’re very asynchronous, both internally and externally, very highly multi-threaded. They’re very, very busy things, these engines, and you’ll see kind of how they fit in a moment.
When I say they’re peer-to-peer, they’re also autonomous. So, unlike a traditional top-down kind of design, these aren’t systems that somebody says these are the things that must all join up to be a database. Like, when you think about partitioning a traditional database, you think, well, I’ve got four machines and I need to divide the database between the four machines -- that’s a top-down process. This is a bottom-up thing. These things can just join in and leave, right? An engine can join in the database any time it wants, it can leave any time it wants. We can put some constraints on that for management purposes, but architecturally, engines are dynamic. They come in, they go out, that’s how we scale out. That’s the whole point about scaling in and out. And then, what they do is that they delegate as much as they possibly can to the next thing I’m going to tell you about, which is called atoms. And so, that’s a sort of a -- it’s basically an actor pattern. But let’s put those two things together in a minute.
So, atoms. The most important thing to understand about this architecture is it’s all built around these things, which are basically container objects: smart objects that contain data. Any data in the system. Not just your user data, your tables and your fields and everything else, but your indexes, your meta data, your -- everything. Everything that’s persistent state or even, in some cases, not persistent state -- is one of these containers. The containers are intelligent in multiple ways. They can certainly serialize themselves to disk, they can serialize themselves to networks, they can replicate themselves, they can keep their replicas up to date. They can do all sorts of things. But there’s all these kinds of atoms. And so, just to talk about why that’s very, very important -- and you’ve probably heard this a zillion times in the context of Docker, but I’m going to say it again: there’s nothing interesting about shipping containers. They’re just big, steel boxes, okay? What’s interesting is the fact that they’re standardized means we can have vastly more efficient ecosystem around those containers, okay? So, it’s not the container that’s changed the world’s shipping industry. It’s the fact that these containers can be loaded and unloaded from ships very, very fast, that they can get onto trains and out of the port as quickly as possible, that they can be handled in these highly efficient ways. And we’ve now built our world’s ports and our cranes and our trucks and our trains and our -- everything else to be around these containers. None of those systems need to understand what’s in the container, obviously. That’s the point. And doing that as an internal architecture of a database is radical. It was one of the biggest breakthroughs that Jim came up with is to say let’s get away from thinking about tables in memory or tables on disk or anything like that. These are just container objects, and all of the interesting things that happen in this system are really about these containers, what we call atoms. You’ll hear me talk about atoms quite a lot.
So, what are atoms? This is not supposed to be exhaustive. I’m certainly not going to go through and explain every -- what every one of them is. But just so you can get a feeling for it, there are a variety of kinds of atoms. All of them, obviously, are atoms. They’re all handled by the system in the same way. They’re instantiated, they’re deleted, they’re copied, they’re stored, whatever else, in the same way. They’re various kinds of atoms. If you look at this kind of dark blue one in the bottom, it says what it is. That’s where your user data would be, for example. But there’s some more exotic ones. If you look at the ones that are called Catalogs, these are the atoms that are used to find other atoms in the system, okay? So, atoms are used as the mechanism for doing, essentially, a name service or directory service. I just kind of put that up just so you can -- understanding everything, literally everything in this system is atoms. The maintenance of consistency between all those atoms is done in the same way, whether they’re user data atoms or index atoms or any other kind of atoms. Everything’s an atom, and therefore, the whole system turns into, basically, an atom processing system. At a deep level, NuoDB is just a bunch of these engines that are loading and saving and instantiating atoms.
So, as we build up -- and that’s what we’re trying to do here, so that there’s an understanding of how it all works. I just want to talk about atom lifecycles, and to do that, I’m going to, for a moment, say to you please forget everything you know about persistence. We’re going to talk only in memory, and we’re going to talk only read only, because it’s easier to understand that step before we try and understand the really hard stuff. So, what you got here is five engines. You’ve got a bunch of atoms. There’s only three in each one, but could be any number. And these are -- think of these as in memory instantiated objects, container objects, okay? That’s basically what they are. And some of them are replicated. You’ll see -- if you look around, you’ll see that, you know, I don’t know if I can pick one out there, but there’ll be some atoms that -- number 12, for example, is in both engine one and engine two. There’s nothing about the system that requires these things to be sort of singletons, right, that can be replicated across the system, and we’ll find out more about how that works in a minute. So, you’ve got engines contain atoms, and now we start getting into some of the kind of slacky strange stuff. Traditionally, in a database, if you’re wanting to -- if you’re running PostgreS or MySQL or Oracle or any of these things, if you’re wanting to load data into DRAM, you’re loading it off the disk, right? That’s how it works. It’s caching stuff off the disk and doing stuff with it.
Here’s the example. Engine one decides I need atom number 56. It’s loading it from someone else’s memory, right? It does the same for atom number 91. What we’re talking about here is that these engines are capable of loading atoms from any other engine. In-memory. This is all in-memory, right? And so, atoms can arbitrarily load data from any other -- engines can load data from any other engines at any time. And that, of course, could be 20 or 30 or 100 of these engines with these various objects.
So, what you’re starting to see here is something that feels like an in memory distributed cache, okay? The ability to be able to load data between them. I think I mentioned to you, and it’s sort of implicit here: all atoms have a name space and they’re numbered in that name space, and they’re unique. But it’s not the instance that’s unique. It’s the content that’s unique. It’s the atom itself. It could have multiple replicas. So, you can do this kind of loading with -- well, I want to just make a couple of points. First is, when I say loading, it’s not moving that atom from one engine to another. What it’s doing is creating an instance of a replica of that atom, okay? So, it’s basically saying, okay, you’ve got that atom over there, atom number 23. I want atom number 23. I’m going to construct it here, and it’s going to be a peer of the one that you’ve got, okay? It’s not being moved. There’s no sort of transactional moving of atoms around. What happens is, I just create another one and it’s got the same state. I recognize that instantly for people like yourselves, you’re starting to think, well, how do I keep them up-to-date with each other? And I’m not going to get into kind of all sorts of transactional issues and two-phase commits and, you know, whatever, but we’ll deal with that, right? For the moment, it’s read only. That’s why we’re talking about it like that.
So, probably enough on that. I think one point is, again, just to say when you do replicate an atom, you’re not secondary to that atom, right? It’s not like there’s a superior relationship between the two atoms. The other thing that’s kind of interesting is that you can drop atoms. Remember I said that these engines are autonomous? They can, at any time, drop anything they want. The engines don’t have to be the same executable, by the way. There’s not -- anything that behaves according to these network protocols can take part and has to behave according to these rules. One of the rules is we don’t care what you do with these atoms. You can load them, you can drop them, you can do whatever you want, as long as you tell us which ones you’ve got so that other people can load them whenever they want. That’s -- you can -- immediately, there are some questions about how do we ensure that we’re not dropping valuable data on the floor, and we’ll come to that in a minute. But the question -- the main point here to understand is that these are these kinds of autonomous peers that are joining together, taking part, and acting as autonomously as possible in the system.
There are ex-- (audio dropout) being able to drop things on the ground, and I’ll get to that in a minute. It’s quite important, really. In fact, what it is, is that some of these engines contract with the rest of the system. They commit to the rest of the system that they’re not going to drop atoms, okay? Those are special ones, and we’ll talk about those ones in a minute. But the general rule is you can load stuff and you can drop stuff whenever you want. It is, basically, a sort of a cache. We’ll talk about it as this durable distributed cache. And so, what you’re seeing here -- we’ve talked about how these atoms could be moved between these in memory nodes, how they can be dropped in these in-memory nodes. You’ve got cache algorithms. The cache algorithms don’t have to be the same on all the nodes. They’re local, and that’s probably implicit.
Andy: Allow me to interject at this point and say the spoilers… It’s not a one-one mapping, right?
Barry Morris: One way to think about it is that atoms are pages, and that can lead you down an unfortunate track, which is that atoms are much more live than that, as you will see in a minute. An atom is an identifier and a state and a queue of deltas, right? A page doesn’t have that sort of live-ness about it. So, we’ll get onto that. But, yes, if you want to, at a very -- kind of a basic level think of it as atoms are a bit like pages -- they are, if you think of pages as being a lot more intelligent. So, most people think of databases as being about storage. Jim Starkey, my co-founder, regards this as the single biggest mistake that people make when they design database systems. And so, what they do is they start with storage. They say it’s a storage-centric transactional system, and then once you do that, you can’t scale out, because now you’re locked into some kind of a single storage mechanism. And so, what -- so far, we’ve talked about something that behaves entirely in memory, and the question is -- but then, you’ve still got to do storage, don’t you? And the answer is yes, obviously.
And so, imagine for a moment that there’s one of these engines that we’ve been looking at on a screen that says the following three things. One, “Guys, I have all the atoms. If you can’t find the atom anywhere else, you can always come to me because I’ve got all the atoms,” okay? So, one of the -- one of these engines says, “I’ve got all the atoms, guys.” It also says, “I’ll always have all the atoms. I’m going to keep them,” right? So, also, “Therefore, you’re safe to drop them if you want to, because, worst case, I’ve got them,” right? It also says, “I’m going to undertake to keep them, even under circumstances of power failure or whatever else that we normally think of transactional systems being able to survive.
So, those three things -- it does say, by the way, “I’m probably a bit slower,” right? And as we’re sitting here thinking, well, how does he do that? Well, probably sticking it on disk or something, right? And that’s why he’s probably a bit slower. “But other than that, I’m just like the rest of you. The only difference between me and you guys is I’ve got all the atoms, I’ll always have the atoms, and I’m guaranteeing that I’m going to have them under certain special circumstances, as well.” And so, you end up with something that looks a bit more like this, where you’ve got this -- engine five has said those three things, I’m the guy that’s got all the atoms. I’m faking the fact that they’re all in memory. I’m slower than the rest of you. Notice something. None of the other guys know anything about whether it’s stored on disk, let alone how it’s stored on disk. The disk storage kind of strategies are nothing to do with the core database, right? In practice, that could be stored on any number of key value stores, could be stored directly on your file system. It can be stored in NVRAM, gets stored anywhere you like and the rest of the system doesn’t know or care. All it sees is performance and guarantees, right? And, of course, as I said earlier, what that does is it turns the other guys into a place where they go, “Well, in that case, I can throw stuff away whenever I want, because worst case, people are going to just be able to go to this other guy that’s got all the atoms, right?”
So, I come into it through this kind of way of the in-memory and everything else, because it’s very easy for people to misunderstand when they see a disk on a diagram like this, they go, “Oh, I get it. Yeah, okay, let’s start talking about that. I want to understand what your disk-on-disk indexing structures are and everything else.” Stop, right? That disk is really just a key value store that’s taking these atoms that are being serialized -- they’ve been clonked on the disk in some format. Doesn’t really matter, right? What really matters is the fact -- is the protocols that are going on between these guys in memory. And think about it: if somebody comes along and if engine one needs to get some data, as you can see, is able to get them from any of these guys, eventually someone might -- like, we got engine two, here, not being able to get object 62 from, yeah, somewhere in medicine. It goes to the guy that’s got everything on disk, and the guy that’s got everything on disk says his object number 62. Once again, when he does that, that’s not being moved, right? It’s just being replicated. Remember I said that? Nothing ever gets moved. It’s just that engine two is just going to engine five and saying, “I want to replicate that object and have my own private instance of it.” Now it’s sitting in two places, one on this guy, engine five, and one on engine two.
So, the guys that are doing this kind of magic stuff about persisting things, we call them storage managers. Might not be a good name, because it tends to make people think about storage engines and things like that. They really are just engines that happen to do some backing store. And the other ones are called transaction engines. So, if you ever see -- if you go to our website or something, you’ll see a lot of language about transaction engines and storage managers. They’re the same thing, okay? Just one of them is the guy that’s kind of volatile and DRAM-based. The other is a guy that stores some of the atoms, okay? Not much else there that I need to -- oh, let me just talk about this. It’s not architectural how the data is structured in deeper storage. Sometimes people come along and say, “Yes, but is it a column store or is it a row store? That’s back to the jet engine analogy. Let’s not talk about spark plugs and let’s not talk about valves, okay? Wrong question. On the back end, it could be both. We could have atom storage strategy which is column storage. We could have one which is row storage. We could change it -- and change it tomorrow, right, on a running database. Wrong question. And the same is true of exactly what the underlying storage architectures are. We’ve run this thing on a wide variety of different things. Typically, people run it directly on the file system, but you can do a wide variety of different things.
So, furthermore, there’s nothing to say. Given that that storage manager is really just a specialized transaction engine, what’s to stop other transaction engines from just doing the same thing? Is there anything in the architecture I’ve just described to you that says there can only be one? No. Nothing in it that says there can only be one. I do understand we’re going to have to talk about transaction commits and things like that in a minute. But just for the moment, we’re thinking read only and whatever. No reason why engine four can’t also say, “You know what? I’ll also have all the objects.” Stick them on some storage mechanism, maybe a different storage mechanism from the first guy, right? And now, if one of these other engines decides it wants to load a particular object, it gets a choice. It can go to one of the guys in memory if that object exists in memory. Doesn’t exist in memory, I can go to one of the storage managers, right? By the way, I know up front, at very low cost, the cost, the response time of going to any of these guys. So, when I’ve got two storage managers and I know that the object is in both storage managers, I can look at the response time and go, “I’m going to pick the one that’s networked closest to me.” Okay, which, by the way, might not be the network closest to me if it gets really busy, because the cost function is going to start favoring the other one and you get natural load balancing. Yeah?
Q: Is the cost function something that’s built in when you build a system, or is it something that changes when you run the system -- like, somebody remembers how fast…
Barry Morris: No, it’s a heartbeat. It’s a heartbeat between all the nodes. So, it’s constantly adjusting, which is why you get a constant load balancing, okay, because if I’ve got a choice of five places to go and get an atom and everybody goes to the first place, he gets busy -- he’s no longer top of the list. I’m going to go to the next one. So, the system just naturally and that’s typical of the kinds of algorithms that go on in the system, okay, but there’s more to it. What I said, if you listen carefully, was -- I said, well, this guy’s really a redundanced copy of the other guy. Both of them have got all the atoms. Doesn’t have to be that way. Actually, we recommend that you do it that way, because then you’ve got K-safety, right? You’ve got redundancy and one of them can go away and will keep going. But you could say, “Well, you got -- you store all the one -- the even-numbered ones, and I’ll store all the odd-number ones,” okay? Now we’ve partitioned the system. From the perspective of the transaction engines, they don’t care. They just go, “Who’s got it? Oh, that guy’s got it. I don’t care why he’s got it.” In fact, we don’t do it based on atom numbers. But if you’re going to partition it, you would typically do it using a SQL predicate, okay? But you can partition it any way you want. You could take the population of the US and partition it by men and women or something, and -- you know? That’s great. But you could also partition it by, you know, state or something. In fact, you could do that in parallel. You could have what we call one storage group, which is these two guys that have partitioned it based on sex, and another storage group somewhere else that’s partitioning it based on first name or something, right, in the same system, at the same time.
So, this isn’t -- you know, when we hear people talk about, “Can you partition the system?” We’re, like, “No, no, no, stop,” right? You can partition the system lots of different ways at the same time, redundantly, okay? And lots of power there. Yeah?
Q: So, how do you enter the consistency of the cache?
Barry Morris: I’m going to come back to that. That’s the hard -- that’s really where the patent is, okay? Because a lot of this is fun and interesting. It’s just good design. That’s the hard question. So, let’s move quickly.
So, we’ve already said this. Basically, it can store overlapping, it can be redundant, it can be partitioned, it can be whatever you want. We would normally say to people don’t deploy with a single partition -- storage manager, why would you do that? Much better to have two or three or five. If you go running it in multiple data centers, put two in each data center and so on. Oops, I just told you that it can run in multiple data centers. We’ll get onto that. So, what does the whole system look like? Pretty much what you expect. So, applications or app servers up there -- you got a thing on the right, which I’m not going to talk about much, which is a whole distributed system of its own. That’s a raft-based distributed system. Basically, a distributed sort of a -- what would you call it? You’ve got a bunch of brokers and load balancers and agents and stuff like that that have got shared state.
Q: Directory service.
Barry Morris: No, it’s not really so much a directory service. It’s really domain manager, which is the security aspects of the system. It’s load balancing between them. It’s membership of the database, things like that. But, yeah. So, there’s that, and then we’ve talked about this layer, which is what we call the platform layer, it’s the atom layer, and there’s two pieces to it: the transaction engine storage managers, which are really the same thing. That’s what it looks like, and we’re going to have a few more diagrams that look like that.
Before we do that, what about SQL? I’ve said all of this -- sorry, question.
Q: If there’s no coordinator, how does each application know, like, which transaction engine…
Barry Morris: So, good question. We’re going to go through that in greater depth in a little bit, but here’s the answer. When this -- say application one wants to connect. There’s five transaction engines, right? Who do I talk to? I go to the brokers and I say, “Give me a” -- okay? And they’re redundant brokers. So, I go to the broker, I say, “Which transaction engine should I talk to?” There’s a load-balancing algorithm there, which could be round robin, it could be sort of a hash-based thing. It could be sort of a usage-based thing, whatever. It tells you which one to go to. And if that broker goes away, as I say, it’s redundant. You go to the next broker, okay? In fact, that’s also how it works when things fail. If that transaction engine goes away, the application goes, “Oops! What do I do? I go back to the broker, broker gives me another transaction engine, get going again,” all right?
So, I’ve said all of this and I said -- like, hardly mentioned the word SQL. And yet, if you go to our website, it’s SQL database, right? SQL database. Why is that? Because that’s where the market is, okay? It’s my little not funny joke when I say that, you know, there’s a 30 or 40 billion dollar SQL market and a zero billion dollar not-SQL market. And so, we’re after the SQL market. It’s nothing to do with what the architecture can do. And so, let me take you through it. This is basically what a transaction engine looks like. That’s what a storage manager looks like. It’s the same thing. I already told you that. It’s the same executable, different command line flag. And what are the -- inside them, yes, there’s a SQL engine, sitting on top of the item layer. And there’s a KV API, a storage API on the bottom of the engine. In the case of the transaction engine, it doesn’t use that storage API. In the case of the storage manager, it doesn’t use the SQL engine. The SQL engine doesn’t really know it’s talking to a distributed database. It’s a conventional SQL engine. It’s our own. It’s a modern, next generation, very good ANSI support, very rich -- you know, great optimizer, great execution engine, all built from scratch by us. It sits on top of every one of these engines that you see, okay?
So, when a connection comes in, you’re running that whole SQL parsing, SQL execution, SQL optimization, all of that stuff. And so, you might say, “Well, gee, Barry, then couldn’t you replace that engine with something else?” The answer is yes. There’s nothing special about it being SQL, okay? Those atoms don’t know they’re a part of a SQL database. The storage on the disk is stored by value, is self-describing store. Doesn’t know that it’s a SQL database. The SQL part is really only the top part of each of these transaction engines, plus the client drivers and everything else. And so is it possible for this to run and be a -- adjacent stall? Yeah, we have one in the labs. Is it possible for it to be a graph store? Yeah, we have one in the labs. It’s completely possible, yep.
Q: So (inaudible) consume atoms or (inaudible), modeled on the atom attraction?
Barry Morris: So, the engines -- the layer of the engine, which is the atom layer, deals only in atoms and knows nothing else, and -- but it might -- it gets asked to hand atoms up to the SQL layer. And the SQL layer knows -- has a thin shim that understands it’s not a standard I-SEM API; it’s an atom API. It basically says, “I want atom number 791.” It gets it.
Barry Morris: And then, at the SQL level, it just treats it just like a conventional SQL database.
Barry Morris: Yeah. So, it could, in fact, be a non-SQL database. That’s not where we’re going, okay? We may or may not take that to market at some point. What we’re really interested in is being the world champion at being a distributed SQL database, and we think we’re basically there.
Now, I’m going to walk you through a select -- and the question earlier about how things are connected, here we go. So, application one connects to the broker, the broker says, “Yeah, you need to go to transaction engine one.” And the select statement, which is a read, of course, comes in and says it connects into that transaction engine. Nothing very interesting there. Sends a select query. At that point, this transaction engine says, “Oh, I need the following” -- yeah, basically, what’s gone -- it’s gone through the SQL parser, it’s gone through the -- that was -- built up an execution plan. At the bottom of the execution plan, it says, “Oh, I need the following set of atoms.” It goes and fetches those atoms. If it can, it’ll fetch them from memory. If it can’t, it’ll fetch them from nearby memory, not memory of a machine that’s in London. And then, if it has to, it’ll go to one of the storage managers. It gets all of those atoms together, and from there on, as we just said, it’s kind of like a conventional SQL system, right? At that point, it’s got all the data, it does what a conventional SQL system does, and it hands back the result.
Q: (inaudible) the [data E?] driver, it talks to the broker (inaudible)
Barry Morris: Correct. Yeah, correct. And in some cases, we’ve been able to even abstract failure like that, so that when a transaction engine goes away, that the JDBC driver itself goes to the broker, finds out, you know, where to go next and reconnects for you. So, there’s some kind of client-side drivers -- so, it’s not really possible. It’s not a big deal anyway, right? It’s just -- it’s a -- so, that’s it. So, that’s your select. Yeah?
Barry Morris: So, atom affinity is kind of a -- very interesting topic, and I’m keen to talk to some of you smart guys about this, because you -- it’s really kind of cache management, right, is a kind of cache algorithm is what you’re really talking about, and about how does a connection that’s coming in figure -- there’s affinity of the atoms that are already preloaded onto this particular node. The simple way that our customers do that today is by having a hash-based load balancer, right, so that the client effectively is saying, “This is a similar kind of part of the database that I’m talking to, here’s my hash key,” right? And we just use that to redirect you back to the same transaction engine. But that’s a fairly crude thing to do, with a -- that could be automated, yeah.
I’m sure everybody know understands MVCC. Jim, my co-founder Jim Starkey was the guy that built it into the first commercial product 30 years ago. Was RDBELN from Digital, and also into InterBase, which is another of his products. It’s, like, alternative lock base concurrency, you know this. Readers don’t block writers, writers don’t block readers. This is what Jim’s whole life has been about as a database guy. And so, as you would expect, NuoDB is kind of steroids version of MVCC, right? This is a distributed MVCC, done by the guy that kind of invented it.
So, I don’t want to spend too much time on it, because we don’t have a lot of time. But I wanted to just make sure everyone understands is that at a -- so, if you want to think of it as a record level, what’s inside these atoms, okay, is MVCC, right? And that’s the way to understand as we move on to updates. So, updates are quite hard, and I’m warning you, you probably -- if you’re thinking hard, you’re going to come up with 70 questions that I haven’t answered up front, but let’s talk about them. Same deal goes -- you’re going to do an update. I’ve shut it down to one transaction engine and one storage manager, so we can’t get too confused, and off we go. So, we go to the load balancer. Of course, it sends us to the transaction engine. We send our update query, the update query says, “Hey, I need all those objects.” The only place to get the objects from is the storage manager. Calls the storage manager, says, “I want to create my own instances of those objects.” Off we go, it does mutation locally, right? So, this is the first thing where we’re doing an update. That update query said, “You need to change these atoms.” So, he did, right? What does change the atoms mean? It means create new MVCC versions, right? New versions of the records, that’s what this is. That’s why they went red. So, these -- now you have atoms up here that have new versions of the records, as yet uncommitted. They’re dirty, right? They’re dirty versions at this point. And the green ones down here don’t know anything about it, right? So, what happens next is that we get asynchronous -- that’s important -- replication going on, right? Remember I said the transaction engine never just takes this updated thing and says, “Here, here’s the whole atom. Go and store it.” That’s not the architecture. The architecture is to say to the guys locally, “Oh, what are the differences between your last version of this record and the new version of this record?” It’s these things. Okay, great. Send that replication message asynchronously, queue it to the guy -- to the other guy, to your kind of replica, right? And that gets sent off whenever it gets sent off, right? And that happens.
Now, this could be within a transaction, you could do 10 million commutations of those objects. And what’s happening is we’re pumping out these replication messages. They’re about 20 bytes typically, okay? They’re very, very small. They’re batched, they’re asynchronous, they’re also encrypted. But anyway, those are being sent off, sent off, sent off to these other guys. And what’s the other guy doing on the other end? When they’re coming in, he’s saying, “Okay, I’m kind of creating these dirty records.” They’re not canonical. It’s not part of the database, it’s not been committed. But, you know, keep going, and eventually, of course, what happens is I’m either going to send a rollback while the rollback -- because the client just told us to, right? The client just said commit or rollback, and if -- oh, the system said rollback. The rollback is easy, it’s a no-up, it’s just, “By the way, you know all that stuff I sent you? Forget it, doesn’t matter,” right? The commit is, “Oh, you know that stuff I sent you? It’s now canonical,” right? Exactly how the commit works is quite a long discussion and it’s -- quite important discussion. I’m -- I hope we can get to it, but I just wanted you to understand, at a simple update, that’s how it works, okay?
Q: You need to know that the commit has to be synchronous.
Barry Morris: Correct, correct.
Q: -- then you have all these asynchronous -- So, you need to do a second flush, right? “Oh, by the way, I told you a bunch of these other deltas, make sure those get in,” --
Barry Morris: Yes -- is the answer.
Q: Okay. (laughter)
Barry Morris: There’s more to it, but yes.
“Okay, so that was a trivial one, Barry. That’s not how we run our systems, okay?” So, what if you’ve got a multi-replica update, okay? So, what’s changed here is that we’ve got -- by having more of these transaction engines and by having another storage manager, now we’ve got multiple replicas, right? So, let’s see what happens. Same deal, we ask who to connect to. He says, “That guy over there. He has -- doesn’t seem to have been doing any work. He’s got nobody in cache, go for it,” right? So, connect to that guy, with same deal. We send in our update. In this case, we go, oh! We don’t have to fetch everything from down here. We can fetch it from these other places, so we’ll go do that. One of them, we have to fetch from down in the storage manager, we’ll do that, and we go ahead and we do our mutation, which is the same thing. So, now we’ve changed these atoms in some way. And once again, now we’ve got to do our replication stream. And what happens is, as you would expect, we go back to the guy that we replicated it from and we say, “Here’s the replication stream, off you go.” That’s great, isn’t it?
No, not great. The reason is because there are other replicas in the system. And so, you’ll notice that, in fact, this guy, number 62, was not just -- didn’t just have to update the 62 on TE-4. He also has to update the 62 on SM-2, right? And so, this is kind of a pub-sub mechanism. Think of it as pub-sub, and you’re sort of getting the -- if you’re atom number 62, you’re interested in everything that happens on 62, whether or not somebody replicated me, okay? At this level, this sort of replication stream thing that we’re talking about is basically being published out to everyone that’s interested in it, okay? And so, in the more general case -- this is why I simplified it -- you’re actually -- the replication updates -- is important. Now, this is the area where -- that Andy and I were talking about. Because we didn’t just replicate to the storage manager -- if we just replicated to the storage manager, that would be a shared disk architecture, okay? Because now everyone else has got a dirty atom and they have to go and pick it up off the storage manager. What’s happening here is that the guy up there, 62 up there on TE-4, is fully up-to-date, right? Doesn’t have to go to the storage manager. And storage manager’s just a kind of a -- and just kind of a log, almost, and -- as far as he’s concerned. So, of course, we do have to do the same thing with this commit and rollback. How are we doing on time? Yeah?
Q: So, in the previous slide -- so, how were the grand (inaudible) in one role? Where are the atoms (inaudible) with them?
Barry Morris: Okay, so, good question. So, remember earlier on I showed you that there are different types of atoms, and there’s one kind of one I said is called a catalog, okay? So, when you bootstrap, when your transaction engine -- and you’re bootstrapping, you load one object. Object number one it’s called, actually, which is the root of the catalog. It’s a tree. The catalog’s a tree, okay? And the catalog is, essentially, a name service. The catalog says, you know, here’s where -- here are all the atoms in the system and here’s where you’ll find them right now, okay? It also tells you the latest -- we were talking about the cost function, the heartbeat. That’s also in the catalog, and there’s a bunch of other stuff that’s in the catalog. Now, you’re going to say, well, how did the catalog keep up to date? Using exactly the same mechanisms. That’s why you use atoms. And remember I said it’s great to have everything being a container? The catalog itself is a distributed system. There’s no central, direct -- it’s not like Hadoop, where there’s a central directory service. There’s -- this directory service is completely distributed and partially replicated. I don’t load the whole catalogue tree. I just load the pieces of it that I care about, right?
So, see if we can move to the hardest part. I’m trying to make sure I’ve got time for questions here. Okay, so, what about update conflict? This is our -- this is not some sort of single user analytics database. This is millions of concurrent users. It’s transactional, LTP style, you know, kind of operational database, and you’re going to have lots of people updating in parallel and barrier -- you’re really sure that you can have 100 TEs and everybody updating in parallel, and this thing’s going to behave in a transactionally acid fashion. And the answer is yes, and this is a big part of it. So, these two guys, now -- two applications are saying, “You know what? I want a TE. I want to connect to a TE.” And they go to the brokers and the brokers say, “Yeah, sure. Here are the guys to connect to.” They connect to them, the usual stuff. They send their updates. I changed the colors a bit so they’re not too confusing. You can see that this guy, app 1, goes and loads a whole bunch of things from various places. This guy, TE-3, loads a bunch of things from different places. But if you’re looking carefully, you’ll notice that actually, they’ve got the same data in some cases.
So, atom 6 has been replicated onto both of them. This -- at least atom 6 has -- and so, what’s going to happen, if they mutate them, is you’ve got a problem, right? Now you’ve got a situation -- this is the hard problem, this is where a lot of the kind of -- yeah, sort of two-phase commit stuff comes in, right? How do you deal with the fact that these two sort of separate transaction engines that are, by definition, highly asynchronous, extremely decoupled -- you know, kind of carrying on in their own way, and suddenly they’ve both got the same atom and they both want to update that same atom and how do you deal with that? What we don’t do is try and fix it at commit time, okay? So, in so many ways, this system’s an optimistic system. But as relates to update conflict, it’s a pessimistic system. What it does is that you have basically a distributed serialization service, which has got a bad name, as well. It’s called a chairman. That makes it sound more powerful than it is. It’s really just a serialization service, and it’s distributed and it’s got fail over mechanisms and so forth. And what happens is both of these guys, these two nodes, these two engines, TE-1 and TE-3, send messages to the serialization service, asynchronously, to say, “You know what? I’m changing this thing. Let me know if that’s a problem.” Sometime in the future, they might get a message back to say, “You’ve got it!” Might get a message back to say, “You didn’t get it,” right? And, by the way, again, they’re autonomous. They can do -- theoretically, they can do anything they want with that answer. I mean, you know, they can’t update it, but they can, at that point, hand back a transaction fail to their -- to the application. They can do all sorts of things and practice -- what they tend to do is to wait to see if they either did get -- it actually rolls back, in which case, I’m just going to continue, because you don’t want to get into live-lock, and you guys probably quite familiar with those kinds of issues.
So, what happens is, in this case, TE-3 lost, okay? Went to the serialization service. The serialization service said no. So, TE-1 goes ahead and does its commit or its rollback.
Andy: And so, to put my professor hat on, would you describe (inaudible) locking?
Barry Morris: It’s -- I’m not going to argue with that. No, I -- and it doesn’t make me uncomfortable, but I think what is interesting -- I mean, the -- there’s a lot of interesting pieces to it.
Barry Morris: I mean, you could -- the reason I’m -- I have no problem agreeing is because, by definition, you know, anything that does what I just described, that does a kind of a conflict avoidance is some kind of a lock. It has to be, right? So, there’s -- there -- but I think that, at -- what’s interesting here is it’s very asynchronous, right? You know, what’s really happening there, I -- this is very simplified kind of toy town thing, you know? You’ve -- you’re -- actually got millions, typically, of atoms for a particular query. An atom, by the way, just so you know, is probably 50 to 100K in size, okay? So, they’re not that big. And so, you’ve got millions of atoms. And then, this kind of -- this kind of conflict resolution doesn’t happen at atom granularity. It happens at much finer granularity than that. But what’s -- so, what’s happening is that the query is most likely trawling through very large numbers of atoms and sending out these messages, these asynchronous messages to say, “Have I got it? Have I got it? Have I got it? Have I got it?” And then, it’s sitting there waiting for the responses, right? And so, there’s a lot of kind of pipelining and interleaving of it, and not very much, you know, sort of synchronous waiting, right? So --
Andy: It’s optimistic or asynchronous?
Barry Morris: Yeah. Okay. Yes. So, that’s pretty much all I wanted to say about some of the deep-down stuff, because I think there’s going to be a lot of questions, or maybe some questions, so -- but what I did want to do is to just give you a kind of a sense of, okay, so why is this so exciting to users? And remember I started with this idea or this chart of kind of just being able to add nodes and it just goes faster, okay? And so, imagine for a moment -- here we’ve got a system, single database -- and, by the way, I should have said earlier: at all times, because of this -- the structure of this thing, it is a single, logical database. The application, the JDBC application only sees a set of tables and a set of rows and a set -- that’s it, right? It doesn’t -- there’s no kind of notion of partitions or anything else at the application level. And, in fact, those things, those kinds of partitions, and other topological changes could be happening while the application’s running. It’s completely independent and orthogonal. So, the application here is running, it’s got, you know, a million users or something cranking away, and it turns out that, you know, we’ve hit our limits. We’ve hit our performance limits, let’s say, and so, we want to get more throughput.
And so, what happens? Well, you know, when you walk through it, it’s actually amazingly trivial. Once you’ve got a system where you can add nodes and take nodes away at will, this -- these answers are all very simple, right? What happens? You bring along your new machine. You’ve just gone and bought it at Best Buy, you take it, you fire up your Linux, you install our stuff, you’re connected in. You have to give it some credentials so that it can connect into the database. It’s there, sits there, okay? I already told you, it actually does something else. It loads atom 1, right? And it sits there. It’s waiting, all right? It’s not doing anything. It’s not being told to do anything; it waits. It waits until this guy decides, “Oh, you know, I now want to make a connection.” The broker takes a look and says, “Gee, there’s a new transaction engine in town. And the guy’s not very busy. I think he needs some work.” And so, boom, you’ve bootstrapped it, right? Immediately, that connection happens, some kind of query gets issued, demand loading of all the atoms that you need, of you go, right? This can happen in, you know, 50 milliseconds or whatever it takes to fire up a Linux process, you know? This is not you know, sometimes, people are so excited about what some of the kind of cloud architectures can do. But when we do this on Amazon, we’re sitting, waiting for them to give us a machine, because the database comes up like that.
I’ll give you an example, which we did as a kind of almost provocative thing. Hewlett-Packard came to us a couple of years ago and said, “We’ve got this thing” -- it was called a Moonshot box, which is -- I don’t know if you’re familiar with it, but Moonshot was this kind of high-density server. So, they have a -- basically, it’s a 4U rack with, like, 45 little micro-servers in it, each with eight gigs of memory and Intel atom processor and all this kind of stuff. And I -- said, “Gee, we’re going to -- want to take this market, we’re going to do this big launch. You guys are distributed database. Why can’t you guys run some great benchmark on it for us?” And we said, “Well, because those are not very good processors and there’s -- you know, it’s fine, but for the same money, you could actually go faster on your older machines. We’ll tell you what we can do.” We said, “Let’s break the world record for running the number of databases on a $60,000 server.” And they said, “Well, okay, whatever that means.” And so, we built an emulator for WordPress, which was -- you know, basically each WordPress account had, you know, 1,500 entries and this amount of kind of access and update and everything else. And we ran a NuoDB system across that Moonshot box. We ended up running on that box of 45 servers 60,000 databases at the same time, 60,000 -- it was actually more, but let’s call it 60,000.
And you ask, what are we really doing? Well, actually, we were cheating. We were cheating because it wasn’t 60,000. It was 6,000, which is still a gigantic number, okay? Six thousand at any given moment was -- in practice, what we were doing was hibernating the database, because it’s so quick to start these things up that what we would do, between queries, we just say, oh, this guy doesn’t seem to have been doing anything for 10 seconds. Shut down the database, we can get it up in much less than a second, okay? And so -- and we have a patent on database hibernation as a consequence. And so, when you’ve got a database system that’s so lightweight, so easy to fire up nodes and quickly, right? It changes the dynamics completely. We also did something, just parenthetically, I say, with it, which is that we said we -- “Guys, we’ll do something else for you. Put one of your big kind of -- whatever they’re called, DL380s or one of their big servers next to it, adjacent to the Moonshot box. And what we’ll do is when one of these databases gets hot, we’ll move it without taking it down,” right? What did we do? Well, we noticed, obviously, it gets busy. We go to the other machine, which is on the same network. We fire up an engine, a TE, right? We redirect the work to their and we shut down the old TE on the little machine, right? From the user’s perspective, they don’t see anything. They’ve just moved onto a bigger machine without missing a heartbeat, without losing any data or anything, okay?
So, that’s called database bursting. We have a patent on that, as well. So, adding a TE, we’re not talking about -- you know, people talk about, “Well, yeah, I’ve got this great sharded system, you know? And I can actually put it onto more shards or -- yeah, but to do that, you have to go back and repartition and do all sorts of stuff. We can just add as quickly as we like. One way of thinking about what’s going on at this top level is that this is a kind of a self-organizing, in memory partitions, this demand loading that goes on. It’s like a self-organizing, in memory partition, and they’re overlapping partitions, obviously. So, that’s pretty cool, all the usual stuff. What about losing a TE? Well, so, we already said these are DRAM systems, right? They don’t touch the disk. All of the data that’s sitting in that transaction engine right now is one of two things. It’s either committed, in which case we can lose the TE, or it’s not committed, in which case all that’s going to happen is that the client is going to get a transaction fail, because there’s going to be a connection fail. It’s going to turn into transaction fail, and you just -- and the new application’s going to have to retry the transaction, which it should be designed to do in the first place. So, you can always lose a TE and there is no consequence, okay?
Now, there are some -- there’s quite a lot of under-the-covers kind of stuff that we have to do to kind of reconstruct some of the system state that’s distributed and so on. But in -- but to -- first approximation, you can go along and you can shoot any TE any time. You can shoot all the TEs if you want to and restart another one, and you’ll never -- you -- so, you’re not going to lose a transaction, you’re not going to lose any data. You’re going to keep going.
Q: So, how do you go up the spikes and exception latency… how do you keep those latency spikes down?
Barry Morris: It’s always going to be a challenge to do things like that. We tend not to see significant latency spikes of the sort that you’re describing. You know, when a restart is as quick as it is, then, you know, it’s not like you’re sitting, you know -- you’re -- that -- a node goes away, okay, there’s going to be a delay for a reconnect. There’s also going to be a reduction in capacity until we’ve fired up another one, right? And you can have different strategies. Some people will deliberately kind of over-provision, will run an extra TE or two, you know, whatever. But I think the main -- major latency here is the reconnect. It’s still sub-second, you know? So, TEs can go away.
Now, let’s talk about adding a storage manager. So, we’ve got the system running, and I walk up to it with my new -- shiny new machine and I say, “Oh, you know what? I want another storage manager. Why? Because I think it’s more reliable to have redundancy,” or “I want to partition it differently” or something. So, I stick the storage manager in and connect it up, again supply the credentials, and what does it do? It takes a look and it says, “Oh, jeez, guys! You guys have got 10 million objects and I’ve got one,” or zero, whatever it is, okay? There’s a problem. It starts loading them, right? How’s it load them? Same deal. It loads them from memory if it can. It loads them from wherever’s the fastest place to load it from, okay? So, it just sits there until it’s caught up, right? So, new storage manager comes online, it figures out what objects it needs to load, loads those objects, and then it puts up its hand and it says, “Guys, I’m ready. I can take part.” Suddenly, everybody else that wants to load objects has another option of where to get objects from, okay? So, very straightforward. Again, there’s detail there about the configuration of what’s going to -- what -- you know, what do you want it to store? Do you want it to store the whole database or any of the database for this continent or whatever it is? But that’s the basic idea. So, the nice thing here is -- and also, by the way, this also works for partial catch-up. So, you could, for example, say, “I want to have a poor man’s back-up. I’ll just switch off one of these storage managers; put it in my sock drawer for a couple of weeks. If I bring it back and put it into the system, it’s going to take a look and say, “Oh, I need to catch up. It’ll catch up and get going again.”
Once it’s connected, by the way, and up and running, it is fully a peer of the others. It’s not secondary, right? This isn’t a master and a slave and whatever. Once it’s up and going, you can just switch off the others, right? Let’s suppose it’s one that you configure to have all the objects. You literally can add this one in, have it catch up, and shut all the others down. If you think about that, that means you can move a database while it’s running. I could start that up in the Amazon cloud, for example, into my database that’s running in my data center and wait. Some stage, it’s going to say, “I’m ready and I’m running” and everything else. I can fire up some transaction engines in the cloud; shut everything down in the data center. My whole database is still running. I’ve kept all my data. I was running it a million transactions per second. I’m still running it a million transactions per second. People talk about data gravity as being one of the big problems. We’re, like, yeah, kind of, right? We don’t really think it’s a problem when you’ve got this kind of architecture.
So, okay, marketing term. Data gravity is this idea that now we’ve got more and more dynamic systems, you can run Docker elements anywhere you want, and micro services can be redeployed and what -- all those -- kind of stuff. But data, you can’t move, right? Data gravity’s this idea that data is welded to the floor, some way, right? And so, in wonderful cloud stacks that we’re building, everything is dynamic and everything is movable and everything can move around, but data can’t. Data is stuck to where it is, right? And so, data has gravity, and it’s being pulled down to earth. And what we’re saying is, well, yeah, if you’ve got an old-fashioned database, that’s true. And this is kind of an obvious one: losing an SM, you know, depending, again, how it’s configured, but you presumably configured redundancy into your system. Your disk drive crashes. SM goes away. Operator gets told, “Fire up another SM,” end of story.
So, just to summarize: what you got here is a system that’s a rich ANSI SQL database system. If you look at who are customers are, they are literally moving Oracle and DB2 and SQL Server and all that stuff onto us, because why? Because it is not a lightweight SQL implementation. It doesn’t work well except when you do joins or something like that. It’s a rich ANSI SQL implementation, full asset semantics, MVCC style. It’s cloud native. All of this stuff that we’re taking about, cloud native. By the way, all driven by REST APIs. This stuff we were talking earlier about what happens in the admin layer? The admin layer exports these REST APIs, which is fire up an engine for me here, fire up an engine for me there, shut down that engine over there, you know, reconfigure -- it’s all REST APIs, it’s all manageable using Kubernetes or your favorite scripting language. Very cloud native, very kind of scale-out and elastic. In-memory performance, you know, as I was talking to Andy about it earlier -- that, you know, people talk about, well, there are these other sort of elastic SQL database products. And the answer is, yeah, okay, let’s talk latency for a moment. We’ve got in memory latency, right? Really, really high speed. And that thing that we showed earlier, when I showed you that chart and I said that our database level latencies are much less than a millisecond. At the time that we did that test, we actually compared it to a well-known document database, and we found that we were 10 times faster, and that was supposed to be a really fast database, okay? So --
Andy: Well-known document -- (laughter)
So, on-demand capacity, we talked about that. Continuous availability, more and more important. All applications nowadays are 24 by seven. Admin automation, I talked about what that’s all about. And what we haven’t really talked about is the fact that other than what Andy says quite rightly, and this can be very important, which is commit protocols. Almost everything else in the system is asynchronous. And what that means is that we’re much, much less sensitive to poor latency networks. If you want to put it a different way, that means we can run in multiple Amazon regions. We can run in multiple data centers, and it really is a case of I can just -- if I’ve got a bunch of nodes running in, you know, in Manhattan, I can walk over to a Jersey City data center, add some nodes into a database that’s running in Manhattan, and instantly I’ve got a distributed database running across that network. Are there trade-offs there? Of course there are. But it will do it a lot better than anyone else will, because it’s happening at this kind of atom replication level.
So, that’s really all I have in terms of presentation. I started off saying bear with me, I’ll try and build this up and build this up so that you can -- by the end of it, you can feel like at least you know kind of roughly what the language is to use if you want to ask me any questions. But I just -- but I have left some time for people to ask any -- yeah?
So, the question that was asked was: I mentioned that you could add a storage manager in the cloud or somewhere else remote, and that it would just gradually copy across all the atoms, and then, after that, you’re running hybrid or you could run in the cloud. And that’s true. The question was, well, what if that’s a very large database? At that point, that’s really a matter of network bandwidth. There’s not much we can do about fixing the network bandwidth for you. But what we would say is you’re better off just picking up the disk drive, you know, walking across to the -- to wherever you’re wanting it to be, plugging it in over there and letting it catch up on that end, okay? Or you can leave it to run and catch up over a long period of time. But there’s not much that we can do other than that. Yeah?
The durability of the transaction. So, I -- this is one of the areas that I didn’t touch on very much. So, we -- I mentioned that after a transaction engine has completed the transaction and the -- let’s say the application is issuing a commit. At that point, what happens is that you’ve got a commit protocol message is sent out, okay? And it’s queued behind all the other messages that are being sent out. The exact semantics of commit in this system are tunable, right? So, as part of the configuration of the system, you can say, “I want commit to mean case safety in-memory,” right, at the low end. Or you can say, “I want it to mean case safety on disk.” Or you can say, “I want it to mean safety in multiple data center” -- there’s various things that you can do. And the system is -- and essentially what’s happening under the covers is that it’s a -- it’s really about understanding the acts, you know? What’s coming back from who when, right, and at what point is it determined that the transaction is committed, right? So, you can dial in if you want to, and we’ve had kind of military type customers that have said, “We need it to be on oxide on three disks. That’s what transaction means or us, right? And the old-fashioned databases can’t do that. You know, can you commit a transaction on three separate disks?” Yes. We’ve had other people say, “You know what? Yes, my stuff is transactional, but it’s not long-lived. I don’t care, you know, I think that my DRAM is perfectly safe. All I need is, you know, three copies in memory and I’m done,” right? That also works.
Yes, the commit protocol, there is a requirement to wait for response, yeah, absolutely.
Q: Okay, then you can see -- like, for example, you said that latency is very low?, but how do you deal with stragglers?
Well, so, let’s be clear. You’ve asked two different questions, okay? The one is at -- on commit, if you’re needing to -- if you’re needing a -- to wait for a response, okay, there’s a possibility that through network latency issues or kind of the machine’s busy or something that it’s -- yes, okay? That’s what a commit is. We will have to wait, and typically, when people are running, for example, multi-data center, you try to make sure that your commit is not happening across a WAN, right, and you can do things to configure that, right? So, that’s -- but, yes, you could -- you and I could easily construct a scenario in which there are commits trying to run across a WAN, you know, so many times a second, that’s not going to run very well.
Your second question about stragglers, I think I need to kind of reframe it slightly, because it depends exactly on what you mean by commit. Now, if you said a commit means that all of them need to come back and say they’ve got it, right, which nobody does -- nobody does that -- if you do -- yeah, then you’re going to get in all sorts of -- you’re going to go slow as the slowest soldier basically, right? That’s not what you do, though. What you -- the -- a default commit might be I’m going to get a commit back from the first storage manager that responds to me. I’m going to consider that committed.
Barry Morris: Well, no, because I’m waiting for the transaction commit. So, I don’t give the transaction commit, transaction isn’t committed and you get a failure back up to the application, okay? So, there’s different ways in which you can configure it. You can configure it, as I say, to say I want it back from two transaction managers, I want -- from two transaction managers and three in memory guys, whatever. Part of what’s hard to understand here and maybe is what you’re touching on is that we really don’t need for everybody to have a consensus, right, in order to commit. What we need is a sort of a baseline that’s agreed, on what that commit is. And we have to figure out the rest in housekeeping, okay? Yeah?
Q: What (inaudible) the number (inaudible) in your DB, like the (inaudible) and the broker system? Are these absolute separate systems or are they reusing them as?
Barry Morris: A huge amount of the stuff that’s core is extremely time critical and is not all floated to other processes. The core engine is a single executable and does a huge amount of the stuff itself, including all the network protocol stuff and whatever, right? There are other things that are separate that are not time critical, you know? And so, this is why we’ve got the separate kind of rafted admin system that takes care of some other things. But no, if you’re talking about things like how do we do the network kind of effect, it’s an effect of pub-sub. It’s not really a pub-- so, doesn’t affect your pub-sub. That is, basically, hand-coded in order to get the kinds of performance that we’re looking for.
Andy: And one more quick question.
Q: To what extent can you tune transaction engines in the same database differently so that one, you know, have a store and another have (inaudible) or something like that? Is it the autonomy of the atom system -- enable that heterogeneity…?
Barry Morris: Yes, and I’ll give you a concrete and useful example. So, first of all, yes, there’s no assumption here that these are the same class of machines, the same number of chords, the same DRAM, the same anything. That’s part of the fun of it. The times when that can be really useful. For example, let’s suppose I want to run what’s called an HTAP workload, you know? Sort of hybrid transactional analytical. So, you’ve got an LTP workload that’s, you know, chundering away at, like, a million transactions a second. Someone comes along and says, “Well, I actually want to do some analytics on this.” If you’re on the cloud, you can say, oh, great, I’m going to grab, for an hour, a machine with a terabyte of DRAM, right, to set up a TE. I might actually start a new storage manager if I want to, just in order handle that analytics workload. Again, because it’s all MVCC, any long-running analytic queries really going to be running against old versions anyway, and so there’s not going to be a lot of chatter. And so, there’s some very useful kinds of things like that where you can sort of say, okay, once a day or whatever, I’m going to fire up a big machine, you know, I’m going to do my analytics or my incremental stuff. It’s going to have marginal effect on the rest of the system. Shut it down again and go home.
Q: And that decision is made by the broker to spin up the machine for a particular queries?
Barry Morris: No, the decision is made by you calling a rest state API. You can configure that into Kubernetes or something else if you want. We, as a company, moved away from -- what we were doing was kind of policy-based management, in which we were doing a lot of that stuff, and we said this intelligence is moving, actually, into kind of cloud-management systems. And so, we’d rather work with those guys and let them kind of drive that stuff as part of overall cloud optimization.
Andy: Thanks Barry. (applause)