You are here

HTAP Webinar: The Future of Data

Hear Dr. Barry Devlin's and Steve Cellini's thoughts on the future of data and where HTAP fits in that future.

Video transcript: 

Lorita: Hello everyone, and welcome to our webinar, the Future of Data, where does HTAP fit?  I’m from NuoDB, and I will be the moderator for today’s webinar.  I’m joined by Dr. Barry Devlin, founder and principal of Mindsight Consulting, and Steve Cellini, VP of product management, NuoDB.  Gentlemen, thank you both for joining us.  Before we begin today’s presentation, I’d like to review a few logistics.  Our webinar will last just under an hour.  The webinar will be recorded and made available for replay.  Attendees will be muted during the call, however, you may submit questions at any time during the presentation using the questions box in the Go To Webinar control panel.  We’ll answer as many questions as time allows at the end of the presentation.

Today, we’re talking about the future of data, and specifically about hybrid transaction analytical processing.  Barry will discuss the history of data management, and how it’s lead to where we are today, and where this HTAP concept fits today, and in the future.  We’ll then hear from Steve Cellini, who will take a look at the same question from the NuoDB lens, and talk through how NuoDB can be used simultaneously for highly transactional operations and real-time operational analytics.  Finally, we’ll open up for Q&A.  So first, let’s go ahead and hear from Dr. Barry Devlin.  Barry?

Dr. Barry Devlin: Thanks Lorita.  Good afternoon, everybody.  It is a great pleasure to be with you here from a wild and windy evening here in Cape Town, South Africa.  Yeah, I’ve been in this business for so long that I can hardly remember it.  And I wanted to delve back a little bit into the past.  You know, I was part of the original founders or fathers, or grandfathers of data warehousing, way back in the 1980s.  And the first place that I’m going to take you is going to be back to the 1980s, and what was going on then.  Because it’s important to understand the past before we start talking about the future.  So, let’s move forward. 

In this first slide, I want to point out the picture up at the top that’s hidden at the top right-hand side.  This is a picture right back from the first data warehouse architecture that was published in the 1980s.  I don’t expect you to look at it, I just want to prove that I was there.  But the bigger picture that’s here actually talks about how we got to be where we are, and why we are in that space.  Now, if you look at that picture, what you will see is that there are two distinct boxes.  And you know them, if you’ve been involved in data warehousing or in business intelligence, you know these very well.  There is a set of operational systems down at the bottom, and then there’s an enclosed box that’s called the data warehouse, which is further split up into enterprise data warehouse, and data mart.  Now, these represent or shows what we were trying to do back in the ’80s.  And it was a separation of running the business versus managing the business.  The difference between operational and informational need.  And when we said run the business, we were talking about using transactions, writing transactions, reading and writing them, records of what was going on.  And it was all about speed and action, and making sure that we were getting things done as fast as we could.

On the other side, we had to the manage the business, which was in those days, very much about a single version of the truth.  About getting consistency across a wide range of systems that we had inherited since the beginnings of the computer era.  Multiple systems having different data, working in different timeframes with different designs.  And indeed, you know, I understand that there’s still some of that there today.  But of course it was much more diverse back then.  And our business people, the decision makers, were very interested in getting a single view of the truth.  They wanted consistency in order to support what they called tactical decision making.  And in a sense, what that was doing was saying we know how to run the business minute by minute.  We’ve got these operational systems, and we need to build a system that manages the business as well. 

Let me talk a little bit about the biz tech ecosystem.  So in this modern environment of the new business world, which I call the biz tech ecosystem, we have this change of business focus.  A change of business focus from the idea that we need consistency, and that we need a single version of the truth to an environment where we’re actually being driven very much by the speed of decision making, and appropriate action.  What we see in the modern world is that when we look at the business, we see businesses operating in a market which is extraordinarily flexible and uncertain, facing enormous competition.  We see businesses who are working directly with customer interaction, with customers who have technical knowledge, who are working on mobile devices.  We have businesses who are working in an environment where there is an enormous amount of externally sourced information.  And we have this entire drive to move quickly from the idea that we see something happening, we understand what’s happening, and we move forward with that in order to get the next action in place.  And we see this most particularly in the emergence of the internet of things, and the previously, in the big data area.  And I always like to mention the internet of things, because it allows me to mention this internet connected refrigerator at the top left of the screen.  And remind you that actually the first hacking of refrigerators has happened earlier this year.  So, not sure why that might happen, but it’s an interesting way that businesses change.  We have to take account of the fact that everything is interconnected these days. 

So, in this environment, in this environment where we have an enormous amount of information, an enormous speed of decision making, the question is, what does that mean for the information that we have?  And what it means, I believe, is that we must have a single view of information.  A single logical space where all the information used by the business is brought together.  Now this picture that I’m showing you is what I call a logical, or sorry, what I call a conceptual architecture.  A conceptual architecture, which is the basis for making decisions at the very highest level about what the business needs, and how the business should use this information.  This information space has a number of characteristics which I’m not going to speak about at any length, but those of you who are familiar with data warehousing will recognize at least the X axis here, which is the timeliness and consistency concerns that we’ve always had to deal with in terms of data warehousing.

Now, having one conceptual, logical space with all of the information is all very well.  It gives us a starting point from which to think about where we’re going, and what we need to do.  But, beyond that, we need to dive a little deeper, we need to see a bit deeper about what’s going on.  And in order to do that, I want to move from this logical space to a more technical or logical view of what the information might look like.  And that divides that space into a different view.  If you recall the first picture I showed you, it was very clear that we had a layered data architecture.  This layered data architecture was made up, as I said, of the operational systems, the data warehouse, and the data mart.  And we moved data from layer to layer.  And that caused delays, it caused the problems of consistency between the layers.  This architecture that I’m showing you now is based on information pillars, rather than data layers.  That says that we’re going to try to keep the minimal number of copies of information, but we will understand that we have many flavors of implementation that we need to deal with.  We need to be able to mix and match technology as needed. 

So, all sources of data and information arrive into this architecture from the bottom of the chart.  Measures, events, messages, all parts of the internet and the real world that’s out there.  They come through this idea of instantiation, of gathering this information, integration of those measures, events and measures, or messages, into perhaps transactions which are the legally binding business events that we want to track.  But also, being stored where necessary, raw as they are.  And so, I have, looking at this architecture, and looking at the data needs over the years, come to the conclusion that we need to have at least three pillars.  There is the pillar on the left-hand side, which I call machine generated data.  This is the stuff that we record as fast as possible and really corresponds very much to the internet of things.  When we talk about technologies here, we’re probably talking about NoSQL technologies, we might be talking about relational, we’d certainly be talking about messaging technologies as well.

On the far right, we’ve got human sourced information.  Human sourced information is the way that people communicate, the way that people record their impressions of the world.  And it’s the Twitter streams; it’s the Facebook messages, and so on.  Human sourced information is another set of information that we probably want to keep fairly raw in this sense.  And here, of course, you will be thinking of Hadoop.  But in the middle, and where we’re going to focus next, is called process mediated data.  This is where the transactional data, the actual business record, ends up.  This is where we keep the very basic information that allows us to run our business.  And you will see a very important phrase across the middle here, called context setting information.  This, I would like to suggest to you, is a new name or a new concept for metadata.  Metadata which I feel has been somewhat devalued recently by some of the uses that are being made of it by security agencies, and indeed, by businesses throughout the world, as they use data to find out what’s going on in the world, in a way that invades privacy.  But it is essentially context setting, and it tells us how everything fits together.  The beauty of this architecture is that the data and the information flows as fast as needed through it, and is reconciled when necessary.

So, let’s move forward.  And look at the process mediated data pillar much more closely.  Process mediated data, in my view, is where we are going to have to reunite operational and informational data.  In the logical view, we think of it as being unified.  I think of it as being a relational model, simply because the relationships are what’s really important here.  The relationships that tell us what it is that fits together, how things relate to one another.  We want to minimize the number of data copies within this space, because as we talked about earlier, if we want to have speed and we want to have the ability to react fast, we can’t afford to be making too many copies of the data.  So in a logical sense, I want to have a unified view.  But in a physical sense, I know that I’m probably going to have to do some level of multiple copies.  And I say some level, I want to try and keep that as low as possible.  So, where’s the difference here?  The main difference here is the idea that in the main part of understanding operations running the business, and understanding the analytics of making quick decisions within the business, that I want to try to move to some sort of hybrid transactional analytical database that enables me to do both operations and analytics within that space.  That gives me the possibility to do what we used to call operational BI, and operations together in the same data space.

Does that mean I’m saying we do away entirely with the enterprise data warehouse and data marts?  Well, I’m not saying that at all.  There will be times when we need these particular constructs.  Certainly, we will need consistency and regulatory reporting in that area, and EDW, is always seen as being a good space.  A good thing to have.  Will we always have everything going through the one operational analytical environment?  Well no, there will always be other systems out there where we will need to consolidate data from.  That gives us the need to have a place where we bring data together in an enterprise data warehouse.  There may well be needs for specialized analysis or reporting.  All of those things say that there will be data warehouses and data marts that are separate.  But what I see is that their role becomes much reduced.  So, the real running, operating, and analytics in real-time of the business is in one place, and the EDW and data marts will specialize needs in a different way. 

So that gives us the thought that we have a system where the central core processing of the business happens in that place.  And there are centralized and distributed options around that.  What that leads us to then is the, is where HTAP comes in.  Now HTAP is a phrase that was introduced by Gartner earlier this year, in a research note on the 28th of January, in fact, where they talked about hybrid transaction analytical processing, fostering opportunities for dramatic business innovation.  Now, when you think about what that phrase HTAP means, it’s saying that within this environment, the need was seen to have a -- the ability to handle both transactions and analysis within a single hybrid environment.  And it’s driven by this need to identify the value of advanced real-time analytics. At the heart of that, in Gartner’s view, and I agree with this, is an in memory data have both transactions and analytics running within the same environment.  That’s not to say it’s necessarily easy.  There are challenges to overcome here, and Gartner has listed a number of them here, including the immaturity of the technology and the fact that there is an established environment, which needs to be dealt with.  But, that’s something that Steve is going to deal with in more detail when we get to talk about the products from NuoDB.

So, I’m going to wrap up here a little bit, I want to mention the fact that much of what we’re talking about here is discussed in my book, which is Business Unintelligence.  Business Unintelligence, which gives us the view of people, process, and information.  So let’s move forward from that, because I know I’m running a little late with our technological view. So this idea that we need to move to a single logical space for all of the information of the business allows us to think much more clearly about the possibility of a running operational and informational element in the same space.  The second point I want to make is that we will be looking at how do we do these physical implementations.  I showed you the pillars, the information pillars, which enable us to think about those centralized and distributed implementations, and the possibilities that we move from SQL through NoSQL, through this idea of the new SQL environment.  And I still want to keep us in mind that when we talk about processes and when we talk about running our business, the core information is still best suited to the relational model.  And whether that’s provided in an in-memory distributed environment and scale out environment, well that’s what Steve is going to tell you about next.  Which is important to the HTAP direction, as we saw from Gartner’s research note of earlier this year.  So with that, I’m going to hand it over back to Lorita to take us forward.

Lorita: Great, thank you so much, Barry.  Steve, why don’t we go ahead and dive into how NuoDB can help address some of the issues that Barry’s brought up? 

Steve Cellini: Thanks Lorita, thanks Barry, and hello everyone.  My name is Steve Cellini, I’m with NuoDB, and I’d like to talk to you about NuoDB’s architecture, value propositions, and in particular how all of this works together in the HTAP, in the context of HTAP.  So, as Barry mentioned earlier, at the start of his slides, it’s all about avoiding the gap where you really are aiming for a single view of the truth, so that with a single database, you can both run the business, and also manage the business.  This means basically mixing your operational workloads where, you know, you’re doing many thousands of short transactions a second, perhaps.  While also running long -- relatively long-running analytic queries at the same time, on the same database.  And the goal here is to be able to gather on and act in real-time on that operational intelligence.  As Barry mentioned, you know, you will be -- your operational data will tell you about what your customers are doing, or what your competition is doing.  He mentioned connected refrigerators, what your devices in your internet of things are doing, and be able to act in real-time on a single copy of the database.  Which is pretty important.  You know, if you need to make copies of the data, that introduces delays.  Delays represent, you know, create stale data, which represents less valuable data.  And the complexity as well is -- has to be taken into account, while you copy the data and then perhaps have to figure out how to properly protect it, and make sure that it is up to date, you know, when you plan to update it next.

In particular, when your customers are moving to a more mobile-based scenarios, where the interactions are therefore typically much more frequent, the expectation is that you’re going to interact in a low latency manner as well.  So basically, this has been, you know, the holy, the grail for a while.  Barry talked about a slide that he showed back in 1988, talking about this idea, that really has been waiting for a technology breakthrough and, you know, the technology issues are real, and, you know, our -- have not been sort of hidden for a while.  They’ve been out in front, and we’d like to talk to you about how NuoDB can address these.  In a nutshell, NuoDB is a full SQL database with full asset semantics.  So, it is a database that is completely suitable for operating your day to day business on, from an operational perspective.  Its architecture is unique though, in that it is a distributed system, which provides the ability to support these HTAP scenarios that we’re talking about.

So to get into more detail, first let me talk about NuoDB’s architecture.  NuoDB has a three-tier architecture.  Go back.  We have a management level, which is represented by the brokers in the schematic.  We have a transaction engine level, we call these TEs.  And then we have a storage level, populated by storage managers or SMs.  You can think of the TEs as being your -- the place where your SQL is processed.  You can also think of them as an in-memory or caching tier.  The storage managers are the tier -- represent a tier where data is made durable on disc.  So the first thing to notice is that NuoDB is a distributed architecture with many TEs and many SMs.  You have extraordinary flexibility in how you configure your database in terms of number of TEs, number of SMs, and where they’re configured.  And as a result, you’re able to achieve basically a level of scalability that fits your workloads.  You can have multiple TEs spread out on multiple hosts in your system, in order to support particular workloads, or to achieve specific levels of redundancy or reliability.  You can basically dial in the level of scalability or redundancy as needed to achieve your continuous availability goals.  So, we think of this as a highly elastic system, where you can scale up the number of TEs and the number of SMs as needed to fit your performance, or scalability, or reliability goals. 

Another aspect is that if you look at this, this configuration, with multiple TEs and multiple SMs, you can place these in multiple regions, or geographic zones, to achieve geo-distribution requirements as well.  And finally, you can operate multiple databases on a given host by firing up multiple transaction engines and storage managers as well.  So, NuoDB’s multi-tenancy support is another important aspect that just comes out of this architecture.  And finally, the management tier gives you a high level of automation in configuring your TEs and SMs to fit your scalability, redundancy, and other requirements.  So, through the automation layer, you’re able to tell the system what level of redundancy you require, what level of distribution, and what level of scalability you require, and the system will manage your transaction engines, storage managed appropriately.  So you’re interacting with the database at a high level. 

If you map this architecture into the HTAP problem space, you get a picture that looks like this.  Where your operational workload, represented by the read/write transactions on the left, can be mapped to multiple transaction engines as necessary to support that load.  On the right, you can map your analytical workload to transaction engines as necessary to support your analytical workloads, which differ in their characteristics.  Their compute memory characteristics, for instance.  Longer running queries, higher CPU requirements.  So, the result is that you can map the analytical workloads to hardware that best matches those workloads.  Maybe it’s high-compute, high-memory workloads, which may not be necessary, for instance, for your operational workloads represented on the left.  And supporting this, you can have one or more storage managers to support the combined workloads.  So, in effect, this architecture, the NuoDB architecture has always been capable, out of the box, to support this HTAP model that Barry was talking about, back in 1988.  It’s this distributed architecture with its multiple nodes, being able to be applied to specific workloads that makes this possible. 

So if you think about what you’re looking for in HTAP from a performance or a throughput perspective, this graph is one way to look at it.  The red line represents your operational throughput, as you add in analytical workload.  And in the ideal case, your operational throughput would remain steady, even as you add analytics workload.  So, you might have multiple analytics workloads doing different -- with a variety of queries and whatnot, operating against the same live data as your operational workload, but your operational workload really wouldn’t see any competition for resources or data locking.  This is the ideal state. 

Traditional relational databases have a hard time getting to this ideal state for a couple of reasons.  The first is that some databases have locking models that result in long running queries, blocking, or being blocked, by operational write transactions.  So effectively, as the operational workload updates or inserts, the analytical processing might be locked by the locks on those roads.  Additionally, relational databases typically are unable to scale beyond their single server, single node architecture.  And the result is that the operational queries and the analytical queries basically compete for the same resources.  Now it’s possible that you could scale up with higher end hardware, that gets expensive and in the end, you still hit a ceiling.  And furthermore, this approach doesn’t work in a cloud environment, where scale out on commodity hardware is the rule.  So a traditional database is basically run into either of these two blocking factors, scale or lock issues, and the level of scale that you can achieve in HTAP would result in that curve shown on the previous slide scaling off pretty quickly. 

So as I mentioned, NuoDB has an architecture that from the start, has always been -- had a natural advantage for supporting HTAP scenarios.  To show this, we ran a set of tests in EC2, Amazon EC2, where we configured an Amazon EC2 instance with five operational workloads, which were basically running short-lived transactions to update random rows in a 5 million row database.  And then two other EC2 instances, each running up to five analytical workloads, running long-running queries against the same data.  And then two storage manager hosts, supporting the operational and analytical workloads.  This is the results that we saw in our tests.  Now these results are average, but they’re repeatable across different sizes of workloads.  Basically what you’re seeing is the expected red line on the bottom showing the operational throughput, basically remaining flat, even as we added, starting at zero, added 10 analytical workloads over time.  The analytical workloads were added at about 30-second intervals, and you can see that as we added them, the throughput on the analytical side basically followed a more or less linear scale out, approaching 300,000 queries per second.  And the operational throughput remained more or less steady at about 50,000 updates per second.  So, this is pretty fantastic that the NuoDB architecture was able to, out of the box, and with very simple configuration, show the expected result, the ideal result, for HTAP.  And in this case, you know, it shows a pretty dramatic linear scale out, as workloads were added. 

So, the key, two important aspects of NuoDB’s architecture make this possible.  First is the use, the support for NVCC.  NVCC avoids the locking limitation we talked about earlier for some relational databases.  Basically, with NVCC, the analytical queries see the latest results, even as the operational queries are updating the results or inserting new data.  So the effect is that both workloads, even with their different characteristics, short vs. long, running queries, are able to share access to the data, while even as the operational workloads are operating in a fully asset-compliant mode.  The second aspect of the test shows how NuoDB’s scale out architecture really comes into play here.  The operational workloads are able to run on hardware that’s best suited for them, whereas the analytical workloads, in this case we added 10 workloads on two separate EC2 instances, are able to operate on hardware that’s best suited for them, and are able to basically max out that hardware without getting in the way of the operational workload. 

An important aspect of this is that the operational -- the analytical workloads basically experience hot cache behavior in the memory tier, in the transaction engines assigned to them.  So that effectively, the analytical operations enjoy a high hit cache ratio, and that cache pattern does not sort of get in the way of the operational cache behavior.  So effectively, multiple caches, each being optimized for their own particular workloads, and you don’t get cache pollution.  And finally, as the scale out graph shows that we’re able to do this on commodity hardware in the cloud, and, you know, be able to get the kind of, you know, ideal behavior that up to now hasn’t really been able to be demonstrated.  I guess a key aspect of this is that the NuoDB architecture lets you apply as much hardware resources as you need to apply.  So if you need to apply more than two -- easy two instances in order to support more than 10 workloads, you can.  The point here is that you’re able to do this with NuoDB.

So, in summary, let me just talk one more time, summarize the key aspects of the NuoDB architecture that make this possible.  As you think back to the distributed architecture that I showed in an earlier picture, you obviously can, you can see how the multi-node configuration gives you effectively an arbitrary level of scale out performance; we showed that here with our linear graph.  You can add more hardware, and in this particular case, commodity hardware in a public cloud to get the level of performance you need.  And the other aspect of being able to -- of the elastic architecture is that as you add nodes, you can effectively approach a level of continuous availability that suits your requirements.  You need more availability and more nodes, and if you need redundancy, add those nodes in different availability zones for different regions.  And if you need geo-distribution for better latency in your workloads, or for residency requirements, you can deploy distributed architecture across multiple geographic regions to support those business requirements. 

NuoDB’s multi-node architecture also gives the ability to support very efficiently, multi-tenancy cases, where you want to support multiple logical databases, in a given data center.  NuoDB does this very efficiently with the benefits of true multi-tenancy, where each database is properly isolated from a security perspective.  All of this is coordinated through what we call no knobs administration, which gives you a high level automated approach, a scenarios-based approach to managing your database to achieve scale and performance, continuous availability, geo-distribution, and multi-tenancy.  And when you put all this together, this is what makes it relatively straightforward for us to demonstrate the HTAP scenario on the NuoDB architecture.

Lorita: Great, thanks so much Steve.  So, I’d like to take this time to remind you that there is a questions panel when you go to webinar, question channel.  Luckily we’ve had some questions that have come in so far.  And so, I want to just go ahead and provide those out to Barry and Steve.  Barry, the first question’s for you.  How is HTAP different from real-time analytics, or is it? 

Dr. Barry Devlin: How is HTAP different from real-time analytics?  Well I think HTAP at some level is a technical way of describing a solution.  Whereas I would say real-time analytics is a business need.  So real-time analytics to me is saying I need to be able to understand what’s going on in my business in real-time, react to it, and take action.  How do I support that?  There are many ways.  One way, of course, is within a traditional data warehouse architecture, to try to move the data more quickly from the operational to the informational environment, and back again.  So, micro-batches or real-time transfer of data between -- message data between the two environments.  So that’s one way of supporting a real-time analytics environment.  But the way that we’re talking about here, and the way that Gartner has described it, is to say we would like to be able to have a single set of data on which we run both operational and real-time analytic business needs.  And that hybrid transaction analytical environment is the, if you like, the technical basis for a new way, or what I think is a better way, of satisfying real-time analytic needs.

Lorita: Great.  I’d actually like to take this time to let everybody know that we will be giving away half a dozen of Dr. Devlin’s Business Unintelligence book.  If you are interested, you can go ahead and submit a question in the go to webinar panel, otherwise we’ll be doing so randomly for those books.  In the meantime, there’s a follow-up question for you here, Barry.  In the longer-term, do you see HTAP signaling the end of the data warehouse? 

Dr. Barry Devlin: That’s a good question.  You know, the future is very hard to predict, as they say.  I think that in my experience, no technology has ever gone away.  I know that I used to work for IBM, and I know that there are still customers out there using technology which was invented back in the ‘60s and ‘70s.  You know, sometimes things just don’t go away.  So in that sense, I think data warehouse and data marts will continue to be around, not because people have no choice, but because there are times when it does not make sense to move older systems onto new platforms.  If you’ve got a very good system, part of your operational environment, let’s say, which is working well, which continues to do the job that it was designed to do.  And rather than having to do a big migration in order to bring everything into the one place, you’re happy to simply bring in that level of information in the integrating way that you used to do in the data warehouse of old, then I would say stick with it.  What I think will happen though is that in areas where people are using data warehouses and data marts to support real-time or near real-time analytics, that those data warehouses and data marts will shrink or will go away, simply because I think this combined environment in the longer term will give the best results and the best management environment to do what you need to do there. 

Lorita: Great, thank you Barry.  Steve, the next question’s for you.  The question is, how do you handle security with multi-tenancy? 

Steve Cellini: That’s a great question.  I mentioned it briefly when I was talking about multi-tenancy, and in effect it’s not really a special case for us.  So, the NuoDB architecture provides for a very efficient model for hosting a given database.  You basically have a TE, a steward transaction engine, and an SM, these are fairly lightweight processes.  You can spin them up very quickly.  You can shut them down or quiesce them very quickly as well.  So what you end up having through NuoDB is a model where multi-tenancy means give each user, or give each application, its own database.  And the system provides the expected level of isolation.  So process isolation and security model for each of those individual databases that you would expect for any other database.  So it’s really not a special case for us.  Except for the fact that we, NuoDB makes it very efficient, to the extent where we did a little stunt with the HP Moonshot system, where we hosted tens of thousands of databases on a very, very small cluster.  You know, the result was a great demonstration of how efficient NuoDB was, can handle this kind of situation, while providing effectively, you know, a full-on database for each user, or for each application.  So effectively, it’s not a special case for us, we handle multiple databases in the multi-tenancy case very efficiently, and security is what you’d expect it to be for any given database.  And this is a significant advantage over multi-tenancy approaches used in other, more monolithic systems, where you basically, the application basically has to fake out the data model so that it appears that each user or each application has command of an entire database, but that’s not really the case where the data is effectively interspersed and intermingled with the resulting issues around performance and security.  So we avoid those completely. 

Lorita: So the next question is with regard to the mixed operational and analytical workload.  With operational and analytics mixed, is data quality an issue?  It seems we did a lot of quality in the ETL days, says the speaker.  Barry, why don’t you go ahead and take a stab at that first? 

Dr. Barry Devlin: Yeah, I mean this is always an interesting question to talk about data quality.  Clearly, when you start talking about doing real-time analytics, one of the tradeoffs that you end up making, whether you do the analytics in one environment or another, is a tradeoff between cleansing and getting good data quality, and getting data through fast.  So you always have that contention between these two needs.  I’m not sure that when I look at it in the sense of the old approach of having both, you know, an operational system and an ETL informational system, I’m not sure whether that gets you that much more quality if you’re trying to run your ETL system so fast that you’re trying to get the data into the operational system in let’s say within a couple of seconds, or even within a couple of minutes of it being created in the operational world.  The quality was created by having time to do the ETL, by having time to bring in data from other places, by having time to have the cleansing and so on, going on in the background.  So, if you start stripping out that time, whether you end up with two separate systems, or whether you end up with one single HTAP system, is probably less of an issue.  What comes back to me always is, and I’ve said this for many years, but it becomes more and more important in the real-time environment, it comes back to making sure that you do quality work that needs to be done.  Which is where you capture the data in the first place.  That’s where you have to really focus on quality. 

Lorita: Great, thank you Barry.  Steve, did you have any input -- yeah, so one of the questions that we have for Steve that just came in is what determines which transaction engine should handle a particular query? 

Steve Cellini: That’s a great question.  I mentioned that one of the -- NuoDB’s built-in capabilities for HTAP comes from the ability to have multiple nodes and have -- and with -- and to have nodes dedicated to particular workloads.  So, the obvious question is how do you do that?  How do you make sure that your operational workload is targeting a transaction engine and its particular cache and hardware configuration that’s appropriate for operational workloads, while doing the same on the analytical side.  So it’s actually very straightforward.  NuoDB, out of the box, provides several capabilities.  The first is that when you configure a NuoDB domain, you identify the host in that domain.  And you can tag those posts with a tag that basically describes the workloads that it would like to take on.  And then secondly, in your ODBC/JDBC connection string, say for your operational workloads, you include that tag in that connection string, and similarly use a different tag for your analytical connection string.  And then NuoDB, because by using its in the box affinity load balancer, will basically at runtime match those tags up so that the operational workloads are directed to the appropriate TEs.  And the analytic workloads go to the analytical TEs with their perhaps high memory or high compute configurations.  So it’s very straightforward just right out of the box. 

Lorita: Great.  Barry, the next question is for you.  What about data lakes?  Where does the concept of data lakes fit in? 

Dr. Barry Devlin: Data lakes, oh, that’s one of my less favorite topics in all the world today.  You know, data lakes are, to me, a great marketing debate.  I know that, you know, when we talk to certain vendors, and there’s certain analysts who put a lot of store in this idea of a data lake, I really believe that data lakes are a step backwards.  Back in the old days, there used to be this concept of build it and they will come.  And in a way, data lakes are that.  They say let’s put all of the data into one place, into one big place.  We won’t worry about what quality it is.  We won’t worry about what shape it’s in.  But, if anybody ever needs to get to it, they -- it’ll be there for them.  I think the problem here is that there’s -- the mistake that people make is that when you do that, the only result, I think, is going to be that your data lake turns into a data swamp.  You put the data in there, and the process of understanding it and finding it, and reusing it, and making production decisions on it, you just put all those aside in the hope that, you know, people will get to the data.  I think maybe I’m old-fashioned, but I feel that quality and control and data management, and data governance, are very important things.  And to me, they need a very different approach than we see, in at least how much vendors and supporters describe this data lake.  The architectural approach that I take, which is this pillared approach, very, very strongly distinguishes between different types of data usage and different data needs.  And of course, there is some area, and I think of the Hadoop area, or the human sourced information area of my own architecture, where you say I do want to have the ability simply to play with the data, to have it there for people to explore.  But my goodness, if I’ve got process-mediated data that’s my core business information, I certainly don’t want it floating around like plastic bottles in a data lake.  I want to know what shape it is, I want to know what’s in it.  I want to be able to know where to find it, and I want to be sure that it’s good quality data.  And to me, that’s why I tend to have very strong opinions against data lakes. 

Lorita: Great, thanks Barry.  Steve, the next question is for you.  The example that you showed, that you ran your test on, was on AWS.  Can NuoDB run on other clouds, or on a private cloud environment? 

Steve Cellini: Yeah, absolutely.  NuoDB can be basically wherever you can run Unix, or Mac, we are able to run.  So, we have customers running in various public clouds, and certainly in private data centers, hybrid cloud situations, VMs, you know, containers, all of the above. 

Lorita: Great.  Unfortunately we have run out of time.  I’d like to extend a special thanks to Barry, and to Steve for leading our discussion on HTAP.  And thanks to you, our audience, for attending our session.  We hope that you found today’s conversation informative and useful.  I want to remind you that if you would like a copy of Dr. Devlin’s book, go ahead and actually, you can send an email to and we’ll add you to the list, we’ll take the first six names that we get from that list.  Otherwise, feel free to get -- to access the resources that you see here.  And we look forward to speaking to you on our next webinar.  This concludes today’s webinar, and thank you for attending.