Barry Devlin, 9sight Lead Consultant, discusses "The Future of Data" and speaks on cloud related topics from his recent book "Business unIntelligence: Insight and Innovation Beyond Analytics and Big Data."
(Eric): Hi everybody, welcome. Thank you, I appreciate it. OK, we’re going to have our presentation started just momentarily. Before we start, everyone make sure to note your menus are in front of you, because there’s going to be a waiter walking around asking you what you would like to eat, and you could also just sign off on the cards if you don’t want to be bothered by the waiter, and just hand them the card, that works too. Whatever works.
We’re going to have a double presentation today, both, two awesome speakers Barry Morris, and Barry Devlin. We’re going to start off with Barry Devlin who is a world-renowned analyst, and the author of an awesome book -- hang on -- Business unIntelligence, please welcome, Barry Devlin. (applause)
(Barry Devlin): Thank you Eric. It’s great to be in New York. It’s great to be working with the guys here in NuoDB. It’s -- I’m here talking about the future of information, which is a small topic that just about fits in the 30 minutes that I have. So, you’ll be glad to hear that I’m just not going to be here all morning. You’ll have time to hear the other Barry. I don’t know if you know, but there used to be an old comedy series in the UK called “The Two Ronnies.” I think we’re going to be the two Barrys here today. In “The Two Ronnies,” there was a sort of a straight guy and the funny guy. I think the other Barry is the funny guy here.
So, I’m going to talk here about some acronyms as we go through. Now, I have to admit, I spent many years working for IBM, and that’s why three-letter acronyms, or TLAs were very popular there. Every product had to be a TLA, a three-letter acronym. Hey now, it’s the 2000s, so now we have multi-letter acronyms. We’re working our way up the alphabet. The first two acronyms here, IDEAL and REAL are mine, and they come from this book. The third acronym which is HTAP, comes from Gartner. And you can tell the difference here; my acronyms are pronounceable. Gartner acronyms, I’m afraid, are just -- I don’t know how they came up with them. But we’ll talk about HTAP, which is Hybrid Transaction/Analytical Processing. So the other ones, I’ll tell you what they are later.
So let’s talk about the future of information. But let’s first of all talk about the past. The past is, when we talk about information, the past goes back quite a deal, quite a distance. Think about 70s, think about the 60s even. There’s a few of you who have been around from the 60s, eh? Not very many. I’m among the oldest here. It’s really just depressing at this stage. So back in the 80s, we talked about running the business and managing the business.
And back in the 80s, I wrote the first paper on data warehousing, which the picture up on the top right is from. It’s from the first paper that talked about the architecture of data warehouse, and it was saying that we needed to have a separate system to look, to manage, to analyze, to report on the business, and I think that’s been a very key theme and a very ongoing thought right since then, and perhaps even before then. Extract the data from the operational systems, as we see here, and put it into a data warehouse; put it into a separate data warehouse in order to manage the business. And back then, as we talked about this data warehouse, we were looking at being driven by a single version of the truth, consistency, and tactical decision making. And that was the work of that time. We were also talking about running the business and transaction and records and read-write type of activities which happened at speed, and was about taking action.
Now, when you think about that, that introduces a very big gap here between the operational environment and the data warehouse, or informational environment. And we’ve spent many years doing ETL; we spent many years working across -- working our way across that gap in order to get data from the operational into the data warehouse systems, right? You guys have done it, yeah?
And the key point I wanted to make here is that, this architecture was driven by both business needs and technology limitations. So the business needs of the 80s and the 70s said, I don’t want to see the minute-by-minute changes of the business. What I actually need to see is I need to see a consistent view of what’s there and how does it work, and get me a trend and an overview. The technology limitations, well, basically, back in the 80s, these were typically hierarchical databases, or perhaps even [flash?] files. This thing was built, first of all in a relational database.
Hey, let me let you in on a secret. Back in the mid-80s, when I working for IBM, one of the reasons that we decided to put the data warehouse onto DB2 at the time was because it wasn’t fast enough to run any operational system, so it was -- the only thing that it was good for was doing the rather slower, and the rather longer-term analytical work. These guys have of course moved on. Most operational systems today are relational based. And we had an enterprise data warehouse, or a data warehouse work which is relational based, but we still have the gap between them. And of course, the relational databases have improved. They got faster, they got more, especially as we’ve got the analytical appliances, we got more power there; we’ve got more things to [play with?]
And as we look at the business side of it, what’s changed? Well, what has changed, I believe, is I call it the “biz-tech ecosystem.” The biz-tech ecosystem is saying one thing. That is, that when you look at business today, all innovations, all advances are driven by technology. No matter what you think about in terms of moving forward with your business, in terms of new opportunities, in terms of changes or processes, they all depend on technology. And so we have to have this symbiosis that we haven’t had in the past between the technical folks and the business folks. Hey, that’s a new idea. No, it’s not; it’s an old idea, but it’s a new thought that we might actually have to do it eventually.
I suppose, are there any technical folks? Are there any IT folks here in the audience? Oh come on, there must be more of you. All these things are always full of IT folks. So, we’ll talk really about technology over the next few slides, so you’ll be happy, right? But this slide is for the business folks. Look at what’s going on here. If one level, we’ve got this entire change in the market, market flexibility and uncertainty. Huge level of competition. The idea of strategic planning, it’s [dead?], especially here in New York, right? It’s, let’s plan for tomorrow if we can.
So competition, market flexibility, it’s so fast, moving so fast these days that the idea of tactical decision-making and longer-term decision making is not so interesting. What we get driven by is immediate needs and immediate decisions. We have an enormous amount of externally-sourced information, information that we never had before, from social media, from the Internet of Things coming along. And I put a few Internet of Things around the outside here, you know, getting data from the automobiles, monitoring systems driving changes in the health market. Of course mobile device is changing everything. And then there’s, the household appliances, I’ve just shown here the internet-connected refrigerator which is a big, big advance as you know. By the way, you do know that the first hacking of a refrigerator has happened. Now who would want to hack a refrigerator?
AU: My dog.
(Barry Devlin): OK, well that’s one way. I didn’t think he was that clever.
So, we’ve got a huge change going on here in the marketplace in terms of business needs and in terms of what’s going on, and that needs fast decision making. We’ve got the need to take appropriate action as fast as possible.
So what does that mean? We have to bridge the gap that is between operational and informational systems. So, I’ve been thinking about this for a while and back around 2008 what I decided to do was to think about this idea that we needed to move from this thought of having multiple copies of information, because there’s so much out there, as far as possible to have a single, if you like, logical space that contained all of the information of the business, as close as possible to a single copy. I’m not saying that’s always possible, but certainly to put the focus on reducing the numbers of copies of data that we have in our business down to a single, if you like, a single instance, as far as possible.
I drew this information space, and I call it the “ideal architecture,” which is integrated, distributed, emergent, adaptive, and latent. Now, integrated, distributed, emergent and adaptive I guess you know about. Latent, I’m just saying is hidden. Latent means hidden. And what I wanted to say is, yes I’ve said that this is what it might look like, and these [access?] tell you interesting things about information and the structure, and how you use it, and how you play with it, and how you manage it, but it’s latent because I’m never going to implement it like that; this is a conceptual idea of how I need to think about information and to think about how it works together.
So, when we think about that, we start to drive down a little bit further. And this idea if I drive further is, if you’ve got this multi-dimensional space, that’s too difficult for most people to handle, so I broke it down into three particular areas that are very different in terms of information. Machine-generated data, that’s the stuff that comes from the Internet of Things; that’s the stuff that comes from the logs of your servers, and so on and so forth. It’s, as it says, generated by machines. And it has a certain set of characteristics which we don’t have time to go into over breakfast, but it makes it different from the other types.
Human-sourced information, the newest one if you like is the stuff that’s coming from social media, and driving an awful lot of the changes in the business area, and making us believe that we can understand what’s going on in our customers’ minds. If you really think that works, that’s fine by me, but yeah, we’re doing that; we’re taking the social media and we’re bringing it in.
The piece in the middle, which is process-mediated data, is the old stuff. Process-mediated data is the stuff that we gather in from transactions, and that needs to be correct because it is the legally-binding basis of what we do for the business. If we lose it, if we get it wrong, we’ve got a problem. We’ve got lawyers on our backs. We’ve got all sorts of issues to deal with if process-mediated is wrong. These two, you know, if we get them 80% or 90% right in many cases, it’s good enough.
So there’s lots of different characteristics here, but I’ve sort of come up with these three different types of information that you can think about in order to understand what it is that you want to do with data and how you need to manage it.
OK, so let’s take it down a little bit deeper. Let’s drive down a little bit deeper into what I call a logical architecture. A logical architecture is getting as close to understanding how we might implement it. So there’s a couple of things I wanted to say about the logical architecture. First of all, the acronym, REAL. It’s Realistic; it’s Extensible; it’s Actionable, and it’s Labile. How many of us know what the word “labile” means? Yeah, I thought so. I had to dig that up out of the dictionary in order to get REAL to be the acronym, because it means “flexible.” And the acronym REAF you can’t really pronounce. So labile, flexible, think about it. There’s a new word you can impress your friends with. “We’re really labile when it gets to implementing our systems,” or, “We’ve got labile systems.” You can play with this. Hey, I’m giving you useful stuff over breakfast, right?
So “labile” means it’s flexible, and it’s flexible in this sense: remember we went -- sorry, can you remember that we had that very layered architecture on the first slide, with the operational system, we moved the data through the layer into the data warehouse; we moved the data through the layers into the data [marks?]. Hey, look, this is in pillars. This is an important change. We’ve just turned this architecture around 90 degrees. And it’s important because what it means is, that we can have data flowing as fast as needed through this architecture, so measures, events, and messages. These are the things that happen out in the real world, OK? We’ve got events that happen, we pick them up from, let’s say, logs. We’ve got messages which are communications between people. We’ve got measures, which are saying, yeah, this is the temperature of New York today, whatever it happens to be. We’ve got all of this information found here, and we bring it into our system here. In this system, we have the ability to have different types of information. I showed you the three types before. Here’s our machine-generated, here’s our process-mediated, here’s our human source.
Now some of you may have heard of a “Data lake.” Have you heard of a data lake? Do you believe in the data lake?
AU: We heard “Data sea.”
(Barry Devlin): Data sea. I think “data swamp.” I don’t think it’s possible to put all data into one system. I think because we are looking at different types of data, and different needs for data, as far as I can see, in the long-term, we’re probably going to have different technologies. And these pillars allow for different technologies, so you can think, well machine-generated data, that could probably -- has to be fast, and I think about a NoSQL environment, I might think of something that is pretty fast. If I’ve got human-sourced information and it’s very flexible, well then Hadoop is a good basis for that.
But for me, up the center, right here at the heart of business, the process-mediated data is always going to be relational, I believe. Now, of course, I’m biased. But that’s my belief, and I invite you to think about why it might be so. Relational has a mathematical, formal logical mathematical structure underneath it that talks about relationships. And relationships are extraordinarily important when I want to manage a business, when I want to be sure that this piece of data relates to this piece of data, and I know how it happens, so the relational model to me is a very key piece.
So the transactions that we take flow through this process-mediated data.
OK, so where is this taking us? How does this work? I want to focus in here now, another bit further, and focus on process-mediated data, and talk about that in terms of what we’re here to talk about today, which is, where is process-mediated data going? Where is this piece of the architecture which was at the core of operational systems data warehouse in the past, where is that going?
So at a logical level, what we have is a unified view. Based on the relational model, and also trying to minimize the number of copies. I said that earlier on as a principle; I want to minimize the number of copies of data. Why? Because managing multiple copies of data is rather expensive and time-consuming because if I’ve got multiple copies of data, I’ve got to understand are they the same, are they different? How do they deal with one another? How do they work together? And also if I’ve got multiple copies of data, I have to copy the data. I (inaudible) extract and transform, but I’ve got to do work in between. And if I want to move to close to real-time decision making, that’s a bit of a problem. So I need to think about, how can I get to an environment that has a single copy of data for the operational real-time analytical pieces?
And this is what this picture is showing. I’m sorry that the words are sort of broken up as we’ve changed resolution on the screen. So transactions come in from the operational processes. We put them into this large box of transactional analytical sort of data. And this is where a hybrid database of transactional analytical data gets to be very interesting. The idea that we could have a single place where we do the operations and the operational analytics, the stuff that we need to do fast, the stuff that we need to do in the moment as soon as it’s there, as soon as the operational transactions arrive, to be able the analyze them, to be able to interact, to be able to take the results of that real-time analysis and bring it back to the operational environment to make changes. This becomes very interesting.
So that’s the sort of idea that there could be a hybrid database of transactional analytical data that we could use, and of course you know that NuoDB is here to tell you about that, and Barry will be going into that in more depth.
Now, I’m not telling you that data warehouse is dead, no I’m not. What I’m telling you is that there will still be some piece of that data warehouse called “core business information repository,” because there will be some things that we want to have [historical?] base. There will be some things that we need to keep consistent. We want to do regulatory reporting on some of them. All of those things together, we may need to have some of the old idea of data warehouses and data markets. And we’re talking here about a centralized and distributed type of environment.
So now we come to the Gartner acronym, HTAP. So, this stuff that I’ve been showing you, I wrote about this last year, the early part of 2013. Gartner finally caught up this year, in January of this year, they came up with this research note called “Hybrid Transaction/Analytical Processing Will Foster Opportunities for Dramatic Business Innovation.” Now I think I’ve shown you that already. But Gartner now have agreed with me, so I’m feeling pretty comfortable and confident about myself with probably 69.3% probability.
So, this is the picture they’ve put. And Barry’s going to go into this in much more depth, but they talk about the basis being an in-memory data store. Now, I think this is important. I’m not going to spend too much time on this, but if you want to think about how would we do operational and analytical processing on the same set of data, we have to have a lot of flexibility; we have to have a lot of speed, and that means we have to have in-memory database. And that’s one of the huge changes that has been emerging, if you like, over the last few years, are these in-memory data stores, the in-memory databases that can enable us to start playing with the thought of analytical and transactional processing in the same environment.
I remember over the years as I was talking about data warehousing and business intelligence through the 90s and into the early 2000s, people would ask me, “Do you ever --” is that for me? So, they used to ask me, is -- do you ever see the day that we can bring back together the operational and informational stuff; I said, no, not in my lifetime. Well, I was wrong yet again because I think we can start to do it now.
So here are some of the things that, you know, that Gartner have identified as part of looking at HTAP, to identify the value of advanced real-time analytics. Because, it’s the advanced real-time analytics that gives you the reason for doing this HTAP processing. And as we talked about a little bit earlier, I didn’t delve into it, but the biz-tech ecosystem drives process change. Look at the way automobile insurance is changing as we enable the cars to talk to the insurance companies, and tell the insurance companies how well or how badly you drive. This becomes real-time insurance, but it’s also an entirely different business process from anything that the motor insurance industry has done before. It simplifies information management infrastructure, because we’re now trying to move this stuff as fast as we can from one environment to the next.
And then, there are a number of challenges to overcome. It’s immature technology; it’s just beginning to come out. There is the established application environment, and the value and the complexity that we need to take in account. And then there’s the coexistence and interoperability in memory databases and the traditional technology, so it’s not all plain sailing, although my other Barry will tell you how plain sailing I’m sure it is. But I’m sure there’s work to be done in order to do this.
So this is where I think the opportunity lies; this is where the excitement lies, in the database industry today. This is a word from my sponsors; this is about the book, it’s over there. Some of you have already taken copies. Those of you who haven’t taken copies, please take copies, because Barry doesn’t want to bring them back to Massachusetts. I, personally, am very happy to sign copies for you before I go, and these are first edition, so signed first editions; in the future, they could be very valuable. I will sign. And, more importantly, the stuff that’s in it actually is a much deeper explanation of what I’ve just talked about here very quickly.
So with that, I’d like to just wrap up and give you over to the second Barry. So what are my takeaways from this? Well the first one is that if you start thinking about the conceptual and logical view that I want to have as far as possible just one instance of information, that’s a great place to start. And then look for, what are the business drivers and the technology limitations that ask for more?
The physical implementations change. We’ve moved from centralized to distributed. We moved from SQL to NoSQL to NewSQL. The technical implementations change, but when we work from the logical and conceptualizing, as I talked about earlier, that gives us the possibility to have a single place to go.
And finally, the relational model is still best for that core information. The in-memory distributed scale-up that the supports that HTAP direction that I’ve talked about.
The picture on the side, who recognizes?
AU: Pandora’s box.
(Barry Devlin): Pandora’s box. And those of you who know the story of Pandora’s box will know that the gods gave Pandora this box and said, “Do not open this nicely wrapped box.” Hey, what a silly thing to do. So Pandora, of course, opened it, and what came out of it? I want the classical education you -- all the evils of the world. But there was one thing left in the box at the end.
(Barry Devlin): Hope. A classically educated man. The last thing that was left in the box was hope. So I hope that leave you with some hope here for the future of relational databases and database processing. Thank you very much. (applause)
(Eric): Thank you Barry Devlin. This is his book, Business unintelligence. Before you guys go, everybody is getting a copy, and you can go Barry and sign it, Barry? Barry? OK, so before he -- before you guys go, he will be available to sign it. If anyone wants -- before we move on, though, we’ll do a little bit of QA if anybody has questions for Barry Devlin? The guy in the back, right there?
M1: What is the role of graph database -- you talked about how relational databases will go (inaudible) process-mediated data, but is there a role for graph databases here, or where do you think it fits in?
(Barry Devlin): I do think there is a role for graph databases. Whether it’s necessarily part of that core transactional process-mediated data, I’m less sure, but I see it much more (inaudible) in the social media stuff, and in creating some of the more complex interactive relationships that goes on, so I don’t think we should ignore any of these technologies. I’m very much about technology -- agnostic in that sense. I’m thinking, (inaudible) can see what the technology is for the particular business problem you have to solve, and graph data does have good (inaudible) reason to be.
(Eric): We have a question right here in the front.
M2: Thank you for the presentation. I am with the News Corporation. My questions simply is, as in the early days when the internet was created with text, photos, and video it got very complicated. So how do you see things going in a video-oriented society with surveillance, management, databases, all that information?
(Barry Devlin): Now, that’s one of the self-box topics, particularly surveillance, and I don’t really want to go there because otherwise Barry M won’t get a chance. In terms of managing it, I think there’s a difference here that for most companies is that, the videos represent what I would call the socially-acceptable view of what people are thinking about the world, right? And so you have to interpret it; it’s not real transactional stuff. It doesn’t need to managed; it’s going to be stored, but are we going to store it forever, I’m not so sure. We might take that one offline afterwards because I think it’s slightly off the topic of what I was talking about here, if you don’t mind.
M2: OK, thank you.
(Eric): We got time for one or two more questions. I see a hand right there. Dave?
M3: Yeah, so, (inaudible). So, you know, I guess, you know, the old (inaudible) relational database model is like a big subscription, licenses, you know, sometimes expensive, and -- but for a new database model, ULA, I mean, (inaudible) open source, and service type of contract, what do you see looking forward? What’s going to be privately most successful maybe for five years, ten years in the future?
(Barry Devlin): It’s a great question. I mean, I always try to avoid these marketing type questions because how you price something is driven mostly by the market. I have a huge personal problem about advertisement-based business models, because I think that that basically involves people selling their information, their personal information in order to drive advertising, and therefore leading to a whole lot of problems around privacy and around surveillance and so on. So I think, my feeling is that eventually, the technical -- the technology industry has to cop onto the fact that if you want to deliver value, you have to invest money to build it, and if you have to invest money to build it, you’ve got to charge money to buy it. It may well be that the old relational database vendors have set the prices too high. It certainly is the case that the driving down of technology prices, the hardware prices, certainly affect how the model works. But I think we have to eventually move back to some sort of charging model in the future. But that’s -- you know, I’m not an expert in that area of marketing stuff.
M4: Hi, (inaudible) data architect. I have such question. I am on the side, of course, absolutely on the side that it’s necessary to start your business [and you to go to logical?] data model, to logical model, whatever will be in the data. At the same time, to look at different models, not conceptually different models, it could be relational model, and again, I agree with the importance of it absolutely. But if to look at some specific model, this dimensional model. And I don’t want to talk about technological advantages, disadvantages, whatever. But one of the advantages of dimensional model, for a person who is not in data business, but in regular business, it’s very often easier to understand. How -- what do you think about your concepts, your lines from point of view of understanding by just business users, which -- (inaudible) for them?
(Barry Devlin): In the context of this session, for me, dimensional models, normalized models are all relational models. And I don’t want to go there, because it’s just -- that’s the debate which will keep us here for the rest of the morning. Just think of it, a relational model can be implemented dimensionally, or it can be implemented in a normalized fashion.
(Eric): OK, that’s all the time we have for questions at the moment, but Barry Devlin will be here for questions after the next presentation, so guys, I just saw a bunch of hands go up. You’ll have a chance; just not right now, please. Thank you very much, Barry Devlin. (applause)
We’re going to proceed with our next presentation, the CEO and cofounder of NuoDB, please welcome Barry Morris. (applause)
(Barry Morris): What a privilege to have Barry with us. It’s not often you get an analyst that’s been a multi-decade distinguished engineer in IBM talking to these kinds of topics. And so he’s got lots of wisdom, and I see most people have got the book. Please take a book if you want, and he’ll be answering some questions later.
I want to take this down this track of this thing that Gartner calls, HTAP, Hybrid Transaction/Analytical Processing. It’s a wonderful Gartner acronym. If you have difficulty remembering it, just read it in reverse, and you can probably remember it better. The paper that Gartner presented earlier this year, essentially said, it switches that. It’s blindingly obvious, and has been for decades, that if you could do analytical processing on your actual transactional data, you’d want to do it, naturally. Everyone’s always wanted to do this. But there are technology limitations and other limitations as to why people haven’t done it. And so HTAP, in a sense, is kind of an obvious thing. But for people that are close to database technology, it’s -- they understand why that’s a problem.
Before I jump into it, I want to just give you another bit of background as to what it is we do, because that’s really the context, and you’re probably wondering what my kind of hidden agenda here is. Let’s just make it a non-hidden agenda. NuoDB is in the vanguard of next-generation database systems. It’s a peer-to-peer system. It’s a system which is a SQL database, can do anything that Oracle can do, but it does it on a scale-out and scaled-in basis. It’s also a multi-model database, so -- and you’re going to see more of multi-model coming in the next few years as more and more databases say, “Oh, we can support documents; we can support SQL; we can support all these different kinds of models.” But this is a fully-transactional SQL database, and it has these wonderful characteristics, so the fact that you can just keep adding machines, it’ll keep going faster, you can take machines if you want. You can run it in multiple locations, multiple datacenters, and it remains transactional. It can store data in multiple places, and have multiple up-to-date copies of the data. Lots of magic things that it can do
And this is our, sort of marketing slide about some of that stuff. It’s continuously available; any machine can die and the system keeps going, because it’s a peer-to-peer system; there’s no central control. It’s a system that has, what we call multi-tenancy. We’ve been able to run literally 72,000 databases on $60,000 worth of hardware. Really radical capabilities compared to what you’re used to with, either the traditional databases, or this last generation of so-called NoSQL databases.
So, next-generation database system. Don’t want to spend too much time now talking about it, but the hidden agenda is that, it’s really good at this thing called HTAP, and I want to talk more about HTAP.
So, this is the traditional thing, and thanks to Barry for talking us through this in greater depth. This is actually from the Gartner paper, and it describes generally how people have done this combination of, on the one hand, transactional processing, which is here to stay forever, what a core piece of our systems that is, and then how do you do analytics? And generally the solution has been, you offload it. You offload into something, whether that’s a warehouse, a Hadoop cluster, a -- you know, columnar database, or something else, (inaudible) of the same system, but off somewhere else, and you go do your analytics on that. And you say, “Well, you know, kind of doesn’t that work?” Yes, it does work, but there’s some big problems which I’ll get into, and then we’ll talk about what the solutions are.
So, first of all, why would you want to do this hybrid transactional analytic processing? The idea that you could do it on the same data in the same system -- (inaudible). Why do you want to do that? Let me give you a sort of an example of it, and it’s something that you all know very well. Google AdWords. So Google AdWords is obviously a transactional application. It’s actually how Google makes all their money. So it’s the thing that’s out there, and it’s charging you for putting up ads, basically. That’s a transactional thing. It’s also, by the way, an analytical application, because between the time that you do a Google search, and the time that those ads arrive, it’s gone off and done a market optimization. It’s gone off and picked the ads that it needs to pick, and it’s decided, based on all sort of criteria, about how many times that ad’s been shown, and what your budget is, and what the rules are, and whether it should happen in this part of the world or not, all these things, it has to do all those analytics and say, here are the ads I’m going to put up. OK, so that’s an example, and a good example of sort of, this kind of an application. It’s hybrid of transactions and analytics. And really the assertion is, more and more applications are going to look like that. More and more applications are going to have transactional aspects, and then this need to do more or less real-time analytics.
And, so that’s some of the justification for it. However, there’s naturally a problem, and it’s this thing on the left. I’m going to go through these in a little bit more depth, but basically, there’s four big problems with doing it this way, and we’ll go through that.
Just to repeat, Gartner said exactly the same that Barry Devlin said a moment ago, which is, that’s not to say that this is -- tries to address all analytics. That’s not what it’s about. There’s aspects of analytics which are going to be batch, and Hadoop-based, and all sorts of stuff. But there’s this operational analytics, which is on your core transactional data that you’re trying to use typically for optimizing business processes, that’s going to happen through this stuff. And so, basically, this is the Gartner picture of where it’s going to go, and let’s take you through some of that.
So the first thing they say is a big problem is architectural complexity. I know most of you are IT people; I don’t really have to explain why this is a bad thing. It’s something which is very costly to set up. It’s something which is very costly to maintain. It’s very costly for you to actually take this and evolve it very quickly. As long as you’ve got this idea of transactional processing on the left-hand side and have various kinds of ways of replicating that data and pushing it into warehouses, and [cubes?], and all sorts of stuff on the back end, and doing analytics. That’s painful, and you want to not have to do that if you can help it.
The second thing is latency, and this is -- and if you like, is real time. Or, this is actually Google Analytics, and we’ve got to like this idea, that this little red box is basically saying, right now. Right now, this is how many people are actually on this site. That’s analytics, and it’s analytics on an operational system. And more and more applications are going to have that kind of characteristic, and by the way, we’ve got quite used to it in terms of things like Google Analytics.
So this analytic latency, but you know what, that’s a real problem if the data’s having to be shipped out to something else to do the analytics. You want to do it directly on the data that’s in the transaction system.
The third issue is really that you’re not necessarily looking at the right data. So what’s happening in the transactional system, by the way, could be changing at millions of transactions a second. And then, you’re offloading it to somewhere else, which is going to take, at a minimum, milliseconds to get it somewhere else. And then you’re going to do some analytics on it. That analytics is not on the latest data. And that may not matter to you, but in many cases, it really does. So you don’t have the fresh data; you don’t really know which is the fresh data. You’re doing essentially historical analytics, and that’s -- and you’re wanting to bring that closer and closer and closer to real time. So basically, this issue of are you really even looking at the right data?
And lastly, and very good talk about this, is you also don’t want to have lots of copies of the data, because it’s a nightmare to manage many, many copies of the same data purely for the reason of needing to do analytics. If you could do it without having to produce copies of the data, you would do that.
Well I’m not really here to talk much more about that stuff. What I really wanted to talk about is the fact that there are innovations, if you recall Gartner’s comment, which is there were technical limitations to being able to do HTAP historically. So OK, so what’s the big deal? What’s changed in terms of database technology that allows you to now start doing this stuff that they’re talking about? And there’s a few things here, and I’m going to go through them.
The first is the topic of concurrency control. Actually, this is in some ways the most important piece. So if you go back far enough in the history of databases, there were queued systems. OK, they were basically single-threaded queued systems, and so a job comes to the front of the queue, and it might be a job to update some ticketing reservation or something, and when it gets to the front of the queue, it’s like kind of a [printed spooler?] or something like that, it gets loaded into the system, it runs, and it runs to completion, and then it gets dumped, and we pick the next thing off the queue. OK, and that’s kind of traditional, kind of non-concurrent batched kind of processing. And what’s great about that is that you don’t have to worry about any concurrency issues. From the point of view of analytics it’s pretty simple. What happens is, you’ve just, the next job that comes up is an analytics job. It has complete control of the whole database. It doesn’t have any kind of sort of concurrency issues of anything else that’s running in the database. It runs to completion; might take a while to do so, but it runs to completion, gets dumped, and you’ve got your report or whatever it is that you wanted to produce. However, of course, there’s lots wrong with systems like this. By the way, there are so-called [model?] databases that use this approach, this kind of single-threaded approach. They will not work for HTAP, because of the problem that that analytics job could take a long time to run, and it’s going to block anything else that’s trying to run.
So of course, people said, well wait a minute. You know isn’t it possible for transactional databases to be concurrent? Well, it sort of naturally feels that way, because if a database is going to update your telephone number and my telephone number at the same time, that’s not going to conflict, most likely, and so why can’t they happen concurrently? They can, right? So concurrent databases become an obvious thing to do, and you can get particularly throughput and latency benefits by doing that. But then you’ve got this problem of, when are you actually going to get conflicts? And just bear with me, because this actually comes back to the topic. Just bear with me for a second.
So how do you deal with update conflicts? You have to have locks, OK? That’s the traditional way of doing things. You say, OK, what we’re going to do is essentially lock this piece of data, either lock it exclusively, because I want to write to this thing, or lock it as a shared lock, because I want to read it, and that allows us to manage these cases of where there’s going to be conflicts. And if you’re trying to update, you’re trying to take out $1,000 at the same time I’m trying to take out $1,000, that’s an update conflict, and locks are going to mediate that, and that’s fantastic. This is how the world runs, by the way. Ninety-three percent of databases in the world are run this way; you know, 100% corporations run this way. Locking is the traditional way of doing things.
It turns out that it’s a really problematic way of doing things from the perspective of HTAP. And the reason is, and again, just bear with me, because what happens is, let’s suppose your analytics job starts running. And it’s trying to do some average of a column. What does it have to do to get a coherent result from that? It has to do a read-lock, a shared-lock on a large piece of the databases. Along comes your transactional updates, which you can have millions a second, and they start hitting these read-locks, and being told to wait. So what you’ve got is, in lock-based systems, you’ve got a situation where trying to run analytics on a transactional system is going to make both the analytics and the transaction slow down a lot. And by the way, it’s one of those kinds of things which isn’t even linear; it can [thresh?].
So what happens is, this is actually one of the main reasons that people have offloaded transactional data onto separate systems and said, OK, let’s just run the analytics and the transactions slow down a lot. And by the way, it’s one of those kinds of things which isn’t even linear; it can [thresh?]. So what happens is, this is actually one of the main reasons that people have offloaded transactional data onto separate systems and said, OK, let’s just run the analytics on a separate system, and allow the high-performance, you know, short workload kind of LTP stuff to keep flying on a primary system, OK?
Is there an alternative? Yeah, there’s an alternative, and it’s called multiversion concurrency control. And by the way, it’s not new. It was invented by my cofounder, Jim Starkey about 25 years ago. It was initially implemented in Rdb/ELN as part of DEC’s database systems. It was a big feature of InterBase, which was one of his products. It’s the core of how our system works, and by the way, it’s also the core of how many modern-day base systems work.
So what does it mean? I’ll give you an example of it. Let’s, (inaudible) I’m trying to print out the customer table, and you’re trying to change someone’s address. OK, in a traditional log-based system, we just describe what’s going to happen. We’re going to have some kind of conflict over that.
In multiversion concurrency control, what you do is, you’re maintaining new versions, or you’re creating new versions of records. You’re not actually updating the original versions of the records. And so what happens is, I decide to do my printout. You come along and you start changing things. What happens for you is you start creating new versions of those records. I don’t care about those new versions of records. I care about the versions of the records at the time that I hit the printer button. OK, so using versioning, you can do concurrency control. And you can do it in a way that readers don’t block writers, and writers don’t block readers. You translate that all the way back out to HTAP, it turns out that therefore, you can do arbitrary analytics on a database system like this without having to take any locks. And if you can do it without taking any locks, what you’re doing is you’re allowing the transactional workplace to keep going.
I apologize if that got a little bit technical for some people, but just, when you hear this kind of HTAP acronym, you’re probably going to find this other acronym somewhere in there, which is called Multiversion Concurrency Control. This is not how traditional databases work. By the way, a lot of them have introduced some aspects of MVCC into them; in fact all of them have, but they’re not MVCC-based.
The second big innovation, and again not really a new idea, but it’s critical to this whole topic, is the topic of row-based or column-based storage. In the middle here, you have a table. Doesn’t matter what the table is, and the traditional way of storing a table in a database is store, essentially the columns contiguously. So you store the record as a whole record. OK that’s kind of an obvious thing to do, and that’s what traditional databases all have done. You know, and sort of, DB2 as talked about earlier, Oracle, SQL Server, and MySQL, all of them are basically row stores; they do this. OK, they store a record, and another record, and another record on the disk.
And that’s really great for OLTP. For transactional processing, this is the right way to do it; it’s very fast. It’s not the right way to do it if you want to get really high-performance analytics. And so, if you look at systems like Vertica, and Sybase IQ, and a whole bunch of other systems, they do this, which is they store all of column 1, and then they store all of column 2, and then they store all of column 3. And there are two main reasons why they do that. The first is, that you only have to pick up the columns if you care about (inaudible) this. OK, suppose there are a thousand columns, and you’re only interested in three. You pick up three, not a thousand. OK, that’s a really big [I/O?]. It also allows you, for the technical amongst you, to do much better compression. So based on, when you’ve got sort of a coherent datatype, you can do all sorts of interesting compression algorithms on that. So, that’s all the kind of technical aspect, but the main point to understand is that, for the analytical part of HTAP, column-based storage gets you big, big wins, OK? Row-based storage gets you wins on the transactional part, and somehow, modern databases that are doing HTAP have to somehow find some hybrid of how they can do these two things, and that’s the second thing that you’re starting to see in terms of these technologies.
The third big one, and this is about -- in effect, this should just say, “Memory is free,” OK? And this is actually what Gartner’s really focusing on. And this is a chart that goes back to 1955. It’s dollars per megabyte if you can’t read it. It starts off at something like a billion dollars per megabyte, and goes down to something which is a tiny fraction of a penny per megabyte, over a period of 60 years.
But what’s the point? Well, one of the things in these days of big data that we kind of tend to forget is, the vast majority of transactional databases are less than a terabyte in size, OK? What does that mean? Well, I can get a cluster of machines with a terabyte of main memory for 50K, something like that? In other words, for about 50, say $100,000, I can put almost any transactional database in the world in memory. And by the way, this is going to continue, so what we’re saying is that, now we’re talking about databases that are, in principle, memory-resident. NuoDB, by the way, is not a purely in-memory database; it’s what we call a memory-first database, which is to say that it will load all database into memory, if you want it to, and it runs effectively in memory speeds, but it’s got full durability in a way that in-memory databases don’t.
But because of this, what does this give you in terms of HTAP? Well it gives you -- it gets you away from one of the single biggest issues, which is around I/O bandwidth limitation. And so, when you’re doing analytical processing, quite often, you’re doing full table scans, and things like that, that really hit the I/O subsystems. When you go into an in-memory system, basically for reads, there is no I/O, OK? It’s all network I/O; there’s no disk I/O.
So that’s the third big thing, and Gartner actually focuses on this one particularly. I think that they were quite excited at the time about technology from SAP, which is the HAMA technology which is (inaudible) in-memory kind of a system.
The fourth piece of this, which is a little bit less obvious, is about resource contention, OK? So even if I have those first three things, and I bring along some big analytical long-running workload, and I throw it at my OLTP system, it is going to take up processor bandwidth, and network bandwidth, and all sorts of other things that are going to slow down my transactional processing. And so, the answer to that is scale-up. And the ability to say, oh, well in that case, let me just add some more nodes. In fact, why don’t I add a specialized node that’s got lots of memory? Why don’t I add a specialized node that’s got associated with a big kind of SSD or something, temporarily? I got go onto -- I know I’ve got some soft-layer people. On soft-layer, you could say, just give me a big machine, a terabyte of memory for an hour. OK, and I’ll run my analytics, and I’ll tell it to join into the database. I’ll run my analytics, and then I’ll toss it out, OK?
So, this ability to scale out gives you this ability to do really quite extraordinary levels of analytics without actually even hitting any of the current resources that are in the database, which is to say the machines and the networks, and so forth.
So, those four things are really the sort of the core of why HTAP has become possible. And so if you go back to, this is kind of the picture of what our system looks like, and this is obviously the one that I know best, it’s a system of collaborating peer processes running on an arbitrary number of machines. They could be running on the cloud; they could be running in your data center. They could be running on some combination, by the way. They could be running in multiple data centers. You could have -- this machine could be sitting in Tokyo; this machine could be sitting in Paris, and it runs quite happily doing that stuff. But the fact that you’ve got these peers, in-memory peers, disk-based peers, that you can keep adding allows you to do this kind of resource-based stuff that we’re talking about. We talk about this architecture as, what we call a durable distributed cache, and that gives you the sense of it being this in-memory system that allows you to this kind of HTAP, kind of processing.
So I don’t have -- I wasn’t really going to come along today to give you a long story about how exactly the product works. It is very interesting, and I think it’s very much a harbinger of where database systems are going to go in the future. You’re going to see them being multi-model; you’re going to see them being transactional. You’re going to see them -- by the way, by “multi-model,” what I mean is, they will support SQL; they will also support other non-SQL technologies, including graph, which was asked about earlier, including document-oriented. You’re going to see systems that look like this, that are capable of doing all of those things. They’re also going to support HTAP, and because applications going forward are increasingly going to look like Google AdWords, and that’s kind of what we’re here to say.
Kind of out of time, so I think what I’m going to do is just perhaps go from here and take some questions. This is my contact address. I have to say that I ran out of cards, so some people asked me for my card earlier and I’ve run out of those, but you can grab my email address if you want to contact me afterwards. So why don’t we take some questions? (applause)
M5: A lot of large firms have embedded investments in their [PI?] infrastructure, and operational data stores, you know, (inaudible) data [live?] sectors (inaudible). And with HTAP and the [glow?] technology, I’m assuming that that significantly, not just that they’re deriving the benefits of providing (inaudible) analytics (inaudible) data, but it could also provide a lot of (inaudible)ation of that data. What [article?] what interest you have seen from firms in taking on their existing data infrastructure and tapping into this new technology that is still emerging, by (inaudible), can you give some comment to that?
(Barry Morris): Sure, yes. So the question is, what kind of market uptick have we seen around, or adoption have we seen of this kind of technology? What’s been interesting is precisely what you said, which is that people see this as strategic. They’re applying it to areas of their systems where there’s the maximum benefit. And so, you know, when people really need to have scale-out transactional processing, you really need it, and you can’t get it anywhere else, and the traditional vendors will say, buy a bigger machine. And we say, you know, add some more machines and take them away again when you’re done. So what you’re seeing is those kinds of applications people love. But the strategic part comes from when people are saying, but this is a consolidation kind of an opportunity. Over time, you can put more and more and more things into a system that can run like this. And it’s running globally, and it’s running, you know, on commodity machines, and it’s very dynamic. You can add machines in the cloud; you can add machines in your data center, all of those sorts of things, that becomes a sort of a strategic initiative for people. And quite often, that’s what we’re seeing is conversations start with a tactical conversation about, can you solve this particular application problem? And then people say, well wait a minute, this is something which is much bigger, and that’s where it tends to go.
M6: Yeah, so I want to know -- in the HTAP arena, but basically, it seems to me that you’re moving to (inaudible) or other thing, but it would be nice to have [simple sort?] of a database actually on that machine, and then, see mostly going to (inaudible) [math and?] database (inaudible). Is that part of the whole HTAP new, (inaudible) fast response time gives you all the application-based capability. It does create a copy of the database, OK, but that could be [new?] (inaudible) cache. Which I assume caching is still OK even in -- only one copy.
(Barry Morris): So the question, in case you didn’t hear it, the question is about, what about mobile devices, which are becoming increasingly powerful, and how much of the database sits on that, and needs to sit on that, and what are the benefits of that? So, the word that comes to mind is latency. And latency is the enemy of all of this stuff. The latency between your mobile device and the data center is very significant, even over kind of LTE technologies and so on. And your data architecture needs to be designed around minimizing the sort of chattiness between the device and the back-end systems. And that means, as you rightly say, that you’re wanting to put, on the local device, you tend to cache data at a minimum, if not have an actual replication data store. The way that this is going in my view is that increasingly, you’re going to see kind of what I like to call, kind of occasionally disconnected, and occasionally connected kinds of models, and that requires database systems to be able to be disconnected, subsequently heal. That’s technology which we haven’t prioritized at this stage, but is part of what’s coming, because the traditional transactional model, in effect requires constant connection. And the reality is that you can’t assume constant connection. You also can’t assume low latency. And so that’s part of the answer.
But for sure, this model is something which is headed towards much more of an edge-centric view of the world, where we will talk about Internet of Things. One of the things that’s not really sort of, I think thought about enough is how are those things going to be connecting? My bet, by the way, is they’re going to be connecting over cell phone networks at very, very high speeds, and that edge computing is going to become really, really critical. Edge computing is naturally distributed, naturally low-latency, naturally requires these kind of AdWords-like applications, and that’s, I think, part of what you’re asking.
M7: That apply to video surveillance, with all the different video and (inaudible)?
(Barry Morris): Absolutely. Absolutely.
(Eric): Next question? Yes.
M8: Yeah, (inaudible), you know, it sounds like you know, you guys [put out?] different areas as well as like (inaudible). (inaudible) column-based (inaudible) type, so, can you give us the, like, best practice, how can [ULS?] and NuoDB, can it be replaced by, like, you know, with a (inaudible), can it be replaced with a (inaudible) platform or something like that?
(Barry Morris): Sure, and what you’re saying is nearly right. I want to just, kind of, before I ask that directly, I want to be very clear about, this system does not look like a database system at all when you look under the covers. And so after when people hear us say it’s a SQL database, they immediately think of, it’s somehow similar to an Oracle or a MySQL, or something like that. If you took a database internals guy and you said, “Here’s the NuoDB source code,” they’d say, “Where’s the database?” Because it’s really an object messaging system that behaves like a database, OK? So that’s the first thing to understand. But I think your question is really about migration. You can take -- we have a migration tools that you can point at a MySQL database and point at us and just suck that data in, OK? And your application, you may have to change some SQL, but most of the time, it just works. So it’s a system which from the perspective of the programmer, just looks like a traditional database. Under the covers, it’s a completely different beast.
M9: Good morning, thank you for coming. When you look at where both either your company is, or (inaudible) is today, what are kind of the top two to three challenges that you see towards adoption, and how are you driving (inaudible) adopt or to overcome those challenges?
(Barry Morris): Challenges to adoption. I think that, you know, I always think about things as the kind of technology adoption curve, if you like. OK, and so they’re not just certain kinds of customers, but certain applications within a customer, that naturally are going to be slow to move anywhere. They just, they kind of work, they’ve got enough, they’re running on old technology or whatever, but they’re fine, and they don’t cost too much. Those are not going to go anywhere anytime soon. We still today, there’s a huge amount of the world’s data processing that happens on IBM mainframes using IMS, and mainframe DB2 and all that stuff. It’s fine; it works.
The adoption happens where people have got their pain, OK, and the pain usually is, they’re moving to some kind of modern datacenter architecture, whether that’s a public cloud, private cloud, just commodities scale-out, they’ve got issues which relate to either kind of bursty workloads, or just sheer capacity issues, or continuous availability needs, or whatever. When you’re talking to people about that, there really isn’t an alternative. That’s not to say that 1980s bases aren’t fantastic pieces of technology; they are. But they’re 1980s databases, and they have single points of failure all over the place. They have, you know, significant challenges with modern applications. So it’s a long way around of saying, from our perspective, and I’ve got some of our sales team here, what our sales guys will do is, they will very quickly talk to a customer and say, you know what, you don’t need that stuff. There’s applications of yours that really have these needs, and for that, it’s an easy conversation. It’s really from the customer’s perspective, can you really deliver what you say you deliver? And if we can, then that’s (inaudible).
(Eric): OK, and on that note, I just do want to introduce Bob and Jack. They are representatives of NuoDB, and here to answer additional questions, as well as Barry Morris. We got time for like two more questions. Anybody want to throw one out? Yes?
M10: You saw it the very beginning of your presentation, old architecture when there was some transactional database, and after that, the decision support database, and so on. One of the approaches today that -- no, I know, it’s definitely -- that you also have to some degree similar architecture, you have some big data, Hadoop, or whatever, doesn’t matter what. And also you could have data warehouse, for example, where you’ll really move data, like from transactional database. What is your opinion about this, you know, from (inaudible) point of view, of course?
(Barry Morris): Sure. So, just try and sort of take that quickly, and it is a big topic. There’s analytics -- there’s a sort of spectrum of different kind of analytics, right? And on the one end, you’ve got more or less discovery, you know, which is what a lot of the kind of the big data conversation is about. And you know, using Hadoop-like infrastructures, where there isn’t any -- prestructuring, there’s no indexing, there’s no -- it’s just raw data sitting in some data lake, and I want to discover whether there’s a correlation between -- one of the classic examples, you know, kind of diapers and beer, or whatever, you know, those kinds of things. That’s discovery, right? Then there’s stuff in the middle which is more like classical kind of DI, and then there’s stuff which is process optimization, where you know, look at Tesla, is a great example. By the way, I wouldn’t want to compete with Tesla. These guys are tracking so many metrics on all of the cars that are out there, all the time. They’re able to monitor and model how long it takes for a battery to die, or how many miles people go, or what temperatures people like to keep the car at, and all that stuff. They’re tracking all that data. They’re not tracking it in some kind of Hadoop-like thing; they’re tracking it from more like an Internet of Things point of view, which is process optimization, OK? Which means they can design a better car, that they can advise you that your breaks need work, (inaudible).
So that is much closer to the end of the spectrum that we (inaudible) about, OK? The one end of the spectrum is kind of discovery, and kind of batch analytics, and all that stuff, right? The other end is really optimization. One thing is doing the right thing; the other one is doing it right, OK? And this is the “doing it right” part of it.
(Eric): Last question?
M11: Is there any difference between in-memory and in-database processing?
(Barry Morris): All right, I want to make sure I understood the question right, so, is there a difference between in-memory and?
M11: In-database processing.
(Barry Morris): OK, in-database processing. There is a difference, yes. So, in-memory is about the fact that, as I said, which is that, the opportunity now is to build databases that are optimized around the idea of huge amounts of memory, in our case, huge amounts of distributed memory, OK? That’s about I/O [wings?], it’s about (inaudible) thing like that. In-database processing is basically server-side processing, which is also a really good idea. By the way, each one of these nodes that I described here has an [invented?] Java virtual machine. You can run arbitrary Java code directly in the database if you want to. That gives you enormous benefits primarily for the reason we talked about before, which is that what kills database -- one of the things that kills database performance is lots of (inaudible)[chappiness?] between a client and the server. And if you can take some of that work and just hand it in a big block to the database engine, and just say, calculate all this stuff, and give me back the results, OK, that gives you other big [wings?]. But that’s actually orthogonal, or somewhat independent to whether it’s (inaudible).
(Eric): That’s all the time we have for questions at the moment. Barry Morris will be here for more questions, as well as Jack and Bob, who are over there. Thank you very much, Barry Morris. (applause) Barry Devlin is here. If you don’t have a book, feel free --