You are here

NuoDB Breakfast Series - California

ACI Chief Architect, Steve Emmerich, and NuoDB CTO, Seth Proctor, discuss key database points to consider when building or migrating applications for the cloud and modern data center.

Slides available here

Video transcript: 

(Michael): What we are trying to accomplish here is to give you both insights more from an end user’s perspective, but also share with you some of the insights that our CTO from NuoDB has on this particular topic.  For those who are not familiar with NuoDB, we are a scale-out SQL database designed for the cloud and the modern data center.  We are specifically actually designed to support the development and deployment of global applications.  So think about NuoDB’s ability to be able to not only run in your data center as a single logical database, but also run across data centers in the SQL ACID (inaudible).

So without further ado, let me introduce you to Steve Emmerich.  He is the chief architect of ACI Worldwide.  For those who are not familiar with ACI, they are a leading payment system provider on a global basis.  The company’s, I think, something approximately a billion dollars.  They are processing something in the order of 13 trillion transactions per day, so it’s a very, very impressive company with very significant data management challenges that they have solved in the past.  But you will also learn more about what Steve’s team plans to do in the future.  He will then be followed by Seth Proctor, our CTO.  And we will at the end of this session have half an hour, approximately half an hour to do both a Q and A, and maybe have follow-up questions that you can ask for support of the presenters.

OK, without further ado, Steve Emmerich.

(Steve Emmerich): Thank you, Michael.  Hello, everyone.

(Steve Emmerich): Oh, OK.  OK, great.  Let’s see, yeah.  So I’m delighted to be here.  I have a half an hour and more slides that can fit in a half an hour, so I will be rushing through.  Please stop if you really don’t grasp something, we’ll have a half an hour after set speaks for combined question and answer, but I want to make sure you understand the story I’m telling as I go.  So don’t hesitate to ask questions.

Quickly, I’ve been doing this since I was a teenager.  Data has been an obsession of mine, so has performance and parallelism.   And I’ve done everything from operating system development to core database optimizer development to application optimization at various levels.  I also have a strong interest in sort of generally speaking for a very high performance, high risk, high volume systems; how one mitigates the threats to appropriate response time, appropriate throughput, appropriate availability of the applications.  My role at ACI is really to oversee the architectural direction along these what I call “nonfunctional requirements;” scalability, performance, availability, configurability, maintainability and so forth, as well as specific product lines I oversee eh architecture of.

So ACI Worldwide, quickly, is a pure place payments software vendor.  The company started about 40 years ago, in fact, this year is our 40th birthday, and we were doing a variety of communications, related consulting before the ATM and the credit card industries really sprung to life.  Found an application for some of the communication protocol work we were doing in the banking industry as it relates to ATMs, built the first payment switches for it to drive ATMs and connect them to the host systems of the banks, and it took off from there.  My talk is really organized as a way to describe the way in which the architecture of our core products, our flagship products, have evolved since that time, and how the requirements have not changed, essentially, over that time, but that the business context has evolved tremendously.  And what that means in terms of evolving our architectures to adapt to the needs of our customers, the changing deployment architectures represented partly by the arrival of multiple data centers, elastic deployment, the other characteristics of what people loosely refer to as “cloud computing” while still maintaining, you know, very fast transaction response times across networks, across multiple parties in the payment ecosystem, but while still maintaining that high level of availability such that your payment never just, you know, stops; it’s like having dial tone on your phone, it’s always there.  So that’s the theme of this talk.

We -- our products are myriad.  We have over 25 strategic products.  They’re kind of grouped to serve these four constituencies; consumer banking, so that would be credit card payments, ATM payments, mobile payments -- things that consumers rely on.  Transaction banking, which is -- might also be called commercial banking, in which treasurers of companies are concerned with managing the liquidity of their company on a day-to-day basis to pay the bills and to, you know, deal with incoming monies.  And also just general financial management online, as well as retail focused services, such as bill pay offered to consumers through banks and through billers, and online banking that we use every day, probably everybody in this room uses in one form or another with the banks that they do business with.  Retailers, point of sale, payment switches, which then connect to acquirers, which then connect to credit card associations, which then connect to card issuing banks for debit payments, credit payment, prepaid cards -- all the forms of payment that we experience in our daily lives.

So about 30 years ago, when the -- see, you do recognize -- about 30 years ago, we built BASE24.  BASE24 was built on the Tandem nonstop architecture.  For those of you who’ve been around for a while, it’s -- I don’t need to explain it, but some of you may not have heard of it.  It’s a scale-out architecture, essentially very little is shared in the architecture, and there’s redundancy at every level, not in a single point of failure.  Applications need to be written specifically for the architecture in order to take advantage of the nonstop characteristics.  But the net of this is that with our BASE24 product, we have been able to support certain banks running this for 20 to 30 years, with not a single outage, and that includes during upgrades.  So no maintenance windows, no down time for ATMs from a switch perspective; there may be, you know, they may have to refill the cash bins or whatever, but it’s from an overall system perspective, no down time.  That’s the kind of reliability and resilience for the banking customers for these kinds of switches require.  And the same technology is applied to our point of sale support; these products are configurable to handle everything from driving ATMs to driving point of sale, and switching to the payment networks for those transactions.  And we have customers like gas company that own gas stations, because they wanted to put in place their own payment system in order not to pay payment processors to do it on their behalf once they reach a certain scale.  So these products are installed in the context of merchants and in the context of banks, and we are in 21 of the top 25 banks in the world use BASE24, which is the original nonstop implementation, or the successor implementation, which is called BASE24-eps.  This was -- BASE24 was written in a proprietary language for the nonstop architecture, BASE24-eps was written in C++, is portable to the nonstop architecture, but also the mainframe, Linux and other flavors of Unix.

You’ll see that the level of redundancy not only stops within each box, it also occurs across data centers through asynchronous replication, and there’s provisions in the application for assuring the reconciliation of any conflicting updates on both sides.  So what we have here is an example of an application that deals with the data integrity problems that can arise in a distributed environment, at the application level strictly; there’s no reliance on the database layer actually having the intelligence to do the replication.  There’s third party replicator technologies involved, and the application itself does all the reconciliation of the conflicts.  That’s a lot of system programming to get that to happen, and for the response time to stay consistent as transaction volumes grow.

In the next generation, BASE24-eps switch, the fact that it was not bound to a single architecture afforded the opportunity to provide scalability and resilience at a number of other levels.  So it’s not dependent on hardware that has high-availability characteristics built in with redundancy at every point.  The redundancy is provided, again, through software which lifts up into the application tier as much of the -- and the middleware that supports it -- the responsibility for assuring that there’s no single points of failure.  Can rely on clustering for -- at the OS level, and at the HACMP at the database level, to assure that failures of those components are handled through failover at those tiers.  WebSphere is clustered -- WebSphere MQ, I mean, is clustered for assuring that at the Q-level, there’s no single point of failure.  Each one of the boxes in a given data center can actually scale out to as many integrated servers -- we call them processes, essentially -- to handle whatever the workload is that’s required.  We get into situations here where we’re literally handling tens of thousands of transactions in our peaks per second, in our peak situations.  But the sort of a target nominal performance requirement for each box is about 2000 transactions per second.  MME, again, for redundancy across data centers, the asynchronous replication can be used.  It’s important to emphasize that when I talk about replication between data centers, I’m talking about an active, active motive operation, that means it’s not one side serving as passive backup for the other.  Both are actively processing transactions, actively connected to whatever the end points are that are injecting transactions into the switch, which then routes to the payment system that works.  And both are updating each other asynchronously and reconciling whatever conflicts occur.

So that’s the world we live in.  We have literally hundreds and hundreds, I think probably three or four hundred installations of BASE24 classic, and BASE24-eps.

So the interesting thing to look at is, what are the requirements that surrounded the construction of these systems?  So we -- and how were they realized in the architecture?  So completely predictable performance and scalability with no spikes, from a performance and scalability standpoint is the requirement, no matter how large we scale.  And in this world of a lot of economies gaining middle-class people who are increasing their number of payments extremely quickly, the scalability challenges go well beyond this decade what they did last decade by large margins.  And we anticipate that that’s just going to continue at an accelerating pace.  At the same time that the response time requirements, which are currently in the 50 to 100 millisecond range as a maximum, regardless of the volume, we feel they need to come down at the switch.  And the reason they need to come down at the switch is because there’s value-added services that are occurring in the context in the transaction path for payments.  So fraud detection is a big one, for example.  And there are other analytic uses of the data can be related to offers being made; for example, on the basis of the transaction history of the individual who is making the payment, which requires additional time, but the overall budget for end to end completion of the transaction can’t increase, because people are even more impatient for completing their transactions than they used to be.  So there’s sort of an ever-increasing need to support growing throughput, while at the same time a need for decreasing response time for every component of the transaction, including, of course, at the data tier, which is part of this reason that all of you are here is to hear about what can happen at the database level.

At the availability level, the fundamental requirement is continuous availability.  Nothing can bring the system down, and that means all of the things that you see listed here, including a number of features that you would typically find in the construction of operating systems, database systems, which have some characteristics similar to operating systems, where throttling -- queuing to prevent spikes, fair distribution of resources to mixed workloads so that nobody is starved out, and throttling when a part of the system get overwhelmed so that there’s back pressure on the rate at which transactions can be accepted.  So the degree of redundancy needs to be configurable.  Every customer has their own concept of how much redundancy per data center, how many data centers they need to achieve the level of redundancy they need.  And the reach of different institutions’ payment systems varies a lot; it sometimes is local, sometimes it’s regional, sometimes it’s global.

So in this context, the application is the first line of defense.  Everything resolves up to the application.  That’s one of the essential questions here, I think, as it relates to Duo DB’s agenda, which is to what extent does the -- is it optimal for the application to be aware all of the defensive measures take place, or is it appropriate for some of that defense to be delegated to the database tier?  And it’s an interesting question that we’ll get into more as my talk proceeds.

So this has more to do with the application management impact on availability.  Rolling upgrades are a requirement; controlled schema change is a requirement without any downtime whatsoever, and all of the capabil-- all of the housekeeping and the infrastructure required to do that is built into our applications.

So what is happening in the payments world?  In general, that is motivating a change in our architecture?  So there’s two topics; what’s happening in the payments world and what’s happening in the IT world.  In the payments world, there’s an increasing focus on real time.  It’s very hard today in the U.S. to get money from one place to another quickly.  I expect everybody here has encountered frustration at one point or another, where they want to get money in somebody else’s hands within an hour, and barring doing a wire transfer, which is expensive and cumbersome, it’s difficult.  You can initiate quickly, you can buy your goods and services with debit payments and credit payments and so forth, but you’re not going to settle -- the funds are not going to settle for -- until that night or two or three days later, typically, for consumer payments.  The notion of any to any payments, the notion of being able to -- P to P type payments, anybody to anybody, or business to business, or business to person -- there’s no sense of a global directory of how you figure out how to send money to a particular individual.  What is the identifying information?  Those are the kinds of questions that are facing the payments industry.  And then, of course, cross-border challenges, multiple stops along the way, and so forth.  So the other issue is that within the institutions that support payments, there’s lots of lines of business, they each want to defend their turf, their revenue base, so there’s a lot of institutional resistance to change.

So ACI’s approach is to, as much as possible, diminish the boundaries between the products that we offer and allow for the change that needs to take place, while still serving our traditional incumbent customers.  So we need a single architecture that satisfies the needs of both existing business models and future business models, and existing payment schemes and future payment schemes.  So we’re very aggressively moving away from discreet products that serve each line of business towards integrated solutions that serve multiple lines of business, depending on how they are configured, or and support new lines of business that might span the historic lines of business, and just sort of break the walls down.  Then in the IT arena, the deployment architectures are clearly changing radically; monolithic big iron-based applications don’t scale well horizontally, they certainly don’t scale to distributed deployments, elastic deployment of system resources against varying workloads by season are an important thing, as volumes change -- I’m sorry, as the reach of payment systems becomes less local and more global.  So the seasonality growth and contractions and capacity that are required are going to be big swings, and especially in the context of solutions that have components that serve different constituencies.  One line of business may be very dependent on the Christmas season or the holiday season, while another cycle may be the monthly payroll, or the biweekly payroll.  So you get increases in demand at different points in time for different lines of business.  When each line of business was served by a different product, capacity planning can be done individually by line of business, when you bring them all together into an integrated solution, that’s less likely to be possible.  So elasticity becomes very important.  And of course, there is a general trend towards X86 based data centers, you know, regardless of operating system and other elements of the technology stack.  Unix sales are going down, mainframe sales are going down.  Everything else is going down, and X86 is going up -- just a simple fact.

So we serve -- we’re kind of an old-line provider of applications, so we continue to serve the needs of customers who are mainframe-centric, or AIX-centric, or whatever platform they happen to have.  But the secular trend towards virtualized X86-based data centers is very clear.

So how are we doing this?  We’re doing this with service-oriented architecture that encapsulates the functionality of our traditional products, which eliminates the redundancy across our traditional products, due to the fact that there’s many components that are very similar, and even for different lines of business.  We also are doing it with a common set of architectural patterns to achieve the same non-consistent, nonfunctional requirements across all of these services.  So on the left, you see the points of entry for payment transactions, on the right you see the networks that are -- and the, you know, other end points that we have to service.  And in the middle, you see our universal payments frameworks supporting all these service-oriented components.  And this is just a similar more component-centric view of the same thing, highlighting the fact that at the bottom, what are the requirements for the data management piece of the stack?  What is everything I’ve said so far about the availability, performance and scalability requirements mean for that piece of the stack?  Will the architecture continue to make the application the main line of defense in achieving those nonfunctional requirements?  Or will some of that work get delegated elsewhere?  That’s one of the fundamental architectural questions that is facing us.

So in that context, we anticipate that as we move to our solution-oriented architecture, there will be data that remains private to the contributing applications that underlie the services that comprise the universal payments architecture.  As you see up at the top, we have private data, but at the same time, some of the data becomes shared across these contributed applications.  Does it -- it seems redundant and inappropriate for there to be a count in customer information, for example, maintained in multiple places, if one is -- if it’s truly an integrated solution, one view of their customer is going to be important.  And likewise, the underlying frameworks have their own metadata repositories, and that becomes what we refer to as services data.  So for entitlements information, for example, associated with individual -- or privileges associated with individual users or accounts or customers, that would constitute data that was associated with our entitlement service.  Fraud is characterized as a shared service, because across any form of payment, there’s opportunity for fraud, and we have fraud products that have their own need for particular forms of the data that are different than the OLTP form of data that the payment systems use for just pushing the transaction through.  So there’s going to be representations of data that are different for different purposes.  That’s just -- OK.

So what you see here is kind of a recapitulation of the requirements that I articulated in relation to BASE24 and BASE24-eps.  But highlighting the fact that the requirements don’t change as we move to this service-oriented architecture, as we move to integrated solutions.  But we still have all of the response time requirements, we have the throughput requirements, we have the availability requirements.  But what’s different?  What’s different is that because all of these components are accessing shared data now, and there’s an incentive for as much of the data to be shared as possible to leverage it across as many components as possible, one gets into a situation where each of the components’ response time and throughput and available requirements can actually conflict.  If they’re accessing the same database at the same time, reading and writing, and one of them has a requirement, for example, to a certain response time requirement, and another has a different response time requirement, how is one going to ensure that each of them -- each of those requirements is met without a certain degree of isolation, while at the same time you’re actually sharing the data?  You need physical isolation and you need logical sharing.  The degree of durability of different types of data transactions can vary.

So, for example, when you’re sending a billion dollar wire -- Michael made the remark that we send 13 million transactions a day, that’s actually not quite accurate, it’s $13 million worth of transactions a day -- and a large percentage of that is big wire transfers between banks.  Banks loan each other money every single day to manage their liquidity; just overnight loans, and a billion dollar transfer is a small transfer in some cases; they just will send a billion dollars overnight in order to cover each other’s liquidity needs ‑‑ just how the banking system works.  And with a billion dollars being in transit, you do not trust -- the systems that need to support that are not going to trust -- there can’t be a single nanosecond when there’s a possibility of transaction loss.  And that requires a degree of synchronous behavior in the commit protocols and in the durability model that supports those forms of payments.  But if you’re dealing with $5 and up to $50, $100 credit card payments, that degree of durability, which occur at much higher volume than the billion dollar transfers, that degree of durability, it’s not worth the cost that’s required.  So a more asynchronous model of durability is suitable.

So what we need from a durability stand point can vary by different components within the same solution.  So the degree of ‑‑ there’s trade-offs between durability and scalability, there’s trade-offs -- and then we also need simultaneous coexistence of different models for durability and response time and throughput.  So how much of the responsibility for that can be delegated to the data management layer?  It’s an interesting question.  And then all of the degree -- the configurability issues be just -- just add up when you have multiple components interacting within a single solution, as opposed to being just being able to manage them within a single discreet product.  So to what extent does the database become the first line of defense?  That’s really the essential question that’s being asked here.

So and then jumping to the deployment architecture considerations, that elasticity requirement -- that’s a new set of requirements for us.  And it’s a fundamental architectural change, it asks the question, does this happen -- at what level does this happen?  Is it again the application’s responsibility, sole responsibility?  Does one lock oneself into a set of APIs that are associated with an elastic model for adding resources and subtracting resources from the pool that is offered by a third party?  Does one do it oneself?  Does one create one’s own platform for that purpose, which can then adapt to whatever deployment environment our applications are deployed in?  We deliver it on premise, we deliver it in our own, you know, private cloud-hosted environment, but even that will change as hybrid models and our customers start to demand the use of their own hybrid, you know, cloud environments start to arise, which will happen because their own provisioning -- they’re not going to provision at their peak requirements all the time.  They’re going to want to burst out into other environments at times.  So we have these considerations to think about, and it’s influencing our fundamental thinking around architecture.

So that pretty much concludes my presentation.  I thank you very much for the opportunity to speak with you. (applause)

(Seth Proctor): All right, so thank you, Steve.  I’m going to talk on a related set of topics, and then as Steve said, afterwards we’ll have some time for Q and A, although also, as Steve said, feel free to jump in if there’s something you really urgently want to talk about while I’m speaking.  I think what you just heard Steve talk through is something that probably everyone has either gone through or is going through, or, you know, knows a friend who’s going through and has been there to counsel them and comfort them, which is this evolution of kind of traditional systems, or previous systems, into modern architectures.  And in Steve’s case, kind of particularly, what happens when you have enterprise systems that represent kind of years of experience, not just in functional requirements, but in the nonfunctional requirements?  And what happens when you’re moving out of one type of architecture or one type of systems into kind of these modern architectures we’re building?  And fundamentally, what does that mean?  What is going to be unique about that experience?  What changes?

And it’s not just about, you know, are you using Docker or Kubernetes, or some tool; it’s not just about are you on Google or Amazon or Azure, or something else like that.  You know, it’s not the software per se, all those things are important, but it’s really stepping back and understanding something about why we’re building the architectures we’re building today, what’s unique about them, what you can do with them, where the challenges are, and how that makes you kind of rethink what you need to bake into each system.  One of the things I think is really exciting right now, really cool about what we’re doing in cloud generally is, I think it’s become both a forcing function and an enabler to take the kinds of experiences that Steve was talking about that have been built in high-end enterprise systems for decades, and start commoditizing that, start making that notion of continuous availability of resilience of other things kind of par for the course.  And that’s a little bit what I want to talk about, is why that is, why that’s the only sane way to really architect systems today, and what you get out of that, but also where the challenges are.  And essentially, therefore, what’s unique about cloud.  And by cloud, I don’t mean public cloud, I don’t mean private cloud, I don’t mean open stack versus something else -- again, this is not a kind of specific software stack conversation today.  It’s really more about, what is it when we’re talking about cloud architectures, or we’re talking about modernizing system, whether you want to call it cloud or scale-out systems, horizontal systems, scalable systems, distributed systems -- what is it we’re doing today?

And I think just to start, I think there are a kind of a couple common themes that are in all these conversations, kind of regardless of the software that people are particularly employing.  And you heard Steve talk a little bit about this, right?  One is being able to get to an on-demand model, and that means being able to scale out for capacity, but also being able to get resources when you need it to increase availability, to be able to provision things very dynamically, very easily, whether that’s on your own private infrastructure or public infrastructure, or some mix of things.  That necessarily means we’re talking about systems designed with flexibility in mind, flexibility in terms of the software, flexibility in terms of the management model, flexibility in terms of being able to run on commodity.  Steve talked about the challenges moving from very specific high-end hardware that gives you certain capabilities to move towards more commodity -- that gives you a great deal more choice, a great deal more kind of cost effectiveness, but brings with it its own set of challenges that you have to understand kind of what that means to the overall state of your system.  And this kind of world, something that’s on-demand, something that’s highly flexible, brings with it, by definition, a great deal of complexity.  And so if you don’t think right from the start about how to simplify these systems, you’re going to end up with brilliant, beautiful software that no one can use.

So Steve talked about some of those nonfunctional requirements.  He also talked about kind of how the system as a whole is something that can be managed.  And so that’s why, you know, we’re building into every system today notions of monitoring, of management, of provisioning APIs, of clear ways of working with systems as a whole.  And resiliency, I think resiliency is the most important theme in all of this.  Redundancy is something we’ve all understood for a long time; redundancy is having another copy of something somewhere -- typically that’s wasteful, because you have resources running that aren’t doing anything, they’re just there in case there’s a failure.  And to get to those redundant resources, there’s downtime.  You talk about failover, you talk about switching to another site, you talk about going to your backup -- by definition, that’s downtime, that’s loss of revenue or loss of service availability, or loss of any number of things you heard Steve talk about the requirements to always be running.  And the way you’re always running is not just to employ redundancy, but to have this kind of on-demand flexible model that is designed to always be running in multiple places actively, and be able to react to failures, kind of roll with those failures and bring things back online, move things around, predictively understand how a system is behaving, so that the system is resilient and always available.  If you squint, this isn’t anything new.  This is why Telecom is always there -- Steve made the comment about the dial tone, right?  And we don’t really have many phones with dial tones anymore, but, like, phones with dial tones, when they were a thing, the reason the dial tone was always there is exactly because Telecom has designed, for decades, arguably for more than decades, with exactly these design goals in mind.  And this is why the system always works.  And I think that’s what we’re trying to get to.  Cloud isn’t about, you know, a particular piece of software, it’s not about a particular, you know, open source community, although it’s all of those things.  But fundamentally, it’s about this kind of philosophy; the dial tone is always there.

And we want this for all the reasons Steve talked about, right?  There are all of these incredibly valuable, really awesome things we get out of modernizing architectures, of recognizing that we’re no longer in the client server era, that we’re building systems in a fundamentally new way, and you get great things out of it.  Steve talked a little bit about, for example, fraud, right?  What is fraud detection?  Fraud detection is really, among other things, a hybrid workload, right, it’s something where you have operational data, but you have to do analysis on it, and you have to do sometimes short-term historical analysis, other times very long-term historical analysis.  Other times, it’s, you know, it’s much more complicated than that, pulling in different data sets from different locations -- that’s a very complicated task.  And traditional data systems -- that’s very hard.

But there are challenges in all of these things, right?  These are great features to have, but it turns out when you distribute a system, when you make the choice to take something off of one’s server and write on multiple services, and then say actually all those servers are active, and then actually some of those servers might be running in physically different locations also still active -- that brings with it a little bit of complexity.  One thing that’s going to happen is failures.  Failures will happen at an astonishing rate.  When you’re going to commodity, there’s a reason commodity’s cheaper.  Among other things, it’s more likely to fail, and it’s more likely to fail in more interesting ways.  And as you build increasingly complicated systems, some of the failures can get increasingly subtle.  I fiercely love doing testing and development work on Amazon, because it is really simple.  It is just so easy to do it.  I also tell everyone who’s doing any real-world work to test their stuff on Amazon, because I’ve never seen things fail the way I’ve seen them fail on Amazon.  Subtle virtualization in the networking tier, subtle things where it doesn’t report an instance was gone, but clearly, like, it wasn’t there for a second, because suddenly, you know, something isn’t quite right, and your TCP cache isn’t lined up anymore.

And it’s kind of, like, I -- and you have no visibility into it.  You have no idea what just happened.  These failures do happen, and, you know, that’s the trade-off here.  It is also, by definition, much harder to have a global view of your system, right?  When you think about a service as a holistic service, when you saw the pictures that Steve put up of that EPS architecture and all the different pieces, there are lots of different moving pieces.  And so it becomes much harder to get that single view of what is the service, what’s happening?  How do I get insight into that?  How do I make sure it’s working as a whole?  That’s difficult from a management point of view.  It’s also increasingly a critical issue from the point of view of security and data life cycle, because if your service is really a whole lot of different pieces that are disconnected and sharing data around, because that’s how you take last generation architectures and bring them forward, you’ve now got lots of copies of your data.  You may have different copies of your data in different kinds of systems, in different locations.  Do you remember, you know, which pieces of data represent PII?  Do you remember how they were ingested into the system originally?  Do you remember, you know, whether you carried forward the right access control lists and the right audit rules?  So that’s harder.  And kind of, you know, everything else -- I’m not going to go through all of the litany of why distributed computing is both great fun and hard.  If you’re having trouble sleeping tonight, I highly recommend you go read some of the classic problems, you know, go read about, like, the Two Generals’ Problem or something, and you know, I mean, the challenges in distributed computing are really amazing, but there’s a great deal of power there.  And it’s those kind of things from the previous slide that are really valuable that make it worth kind of chasing this, I think.

And we kind of understand how to do distribution in a number of layers in the system, right?  I think probably everyone in this room has worked with load balancers, or has worked with CDNs, has worked with scale-out storage architectures -- I mean, these are things that we all kind of understand.  You know, we understand how to take, like, a web tier and scale it out with lots of application containers.  The challenge, I’m going to argue, the really hard challenge is in scaling the database, scaling that data tier.  And as Steve said, for most people, the data tier is the platform that you build on, and the interesting question then becomes, what can you assume of that core platform?  What can you carry forward into, therefore, what your applications can do, what your operators can assume, what is the model that you can build on?  Why is it hard?  Why is it hard in this cloud world to scale out the database architecture?

Probably lots of reasons; I’ll give my view on kind of a couple of reasons.  One is when you look at kind of the RDBMS, the traditional architecture, you go back to when Steve said he was first working on software -- sorry Steve -- and you look at kind of those original architectures that came out of IBM, came out of other places, kind of the early database architectures, they were exercises in trying to scale very limited resources.  And so you get this architecture that is a very tight coupling between the disc and where you’re actually working with data.  And we’ve carried that architecture forward 40 years, this idea that a database is where your data is, is where the disk is, and so this is great for vertical scale.  But it means if you go onto Amazon and you’re setting up a database, one of the questions you have to answer is how many IOPS do you want to provision?  And that’s not cloud, by definition.  We just talked about that a few slides ago.  Cloud should be on-demand.  Use the capacity you need, get rid of it when you don’t.  The idea that you have to provision IOPS is nonsense.  But traditional databases are all predicated on access to the disk; that’s the scaling point, that’s how you perform well, and that is, therefore, one of the bottlenecks.  That’s part of why it’s hard to do this scale-out.  You can try to apply caching, caching with the exception of some very high-end, expensive custom caching solutions tend to break the consistency model, the kind of the semantics of how a database works.  As a result, these highly available systems that Steve is talking about, continuous availability -- this becomes very expensive, it becomes very hard to build; it becomes very hard to evolve your schema over time, to evolve your operational model over time.  Harnessing commodity infrastructure becomes more and more challenging, because that type coupling to disk is also kind of a single point of failure.  So we’ve built systems over the years where we assume that the disk has to be hardened in some fashion, and that’s not the architectures that we’re building today.

So this is not something that’s designed to scale out.  But that said, as an industry, I think one of the things that best connects everyone in the computer industry is that we tend, by and large to be pragmatists.  We tend by and large to be very good at saying there are things we have to solve, and even though they can’t be solved, they have to be solved.  So we’ll use the tools we have and we’ll figure it out, and we’ll move forward, right?

So I suspect just as everyone in this room probably relates to the story that Steve was telling of trying to take a previous generation system and move it towards new architectures, probably everyone in this room relates with at least one of a couple of common patterns that we just use all the time in our industry, right?  How do we take the last generation architectures, or how do we take the systems that we’ve been building in the last 10 years that, in a lot of ways, mimic those last generation architectures, but then throw out kind of the complexity to make it kind of simpler to solve certain problems and sacrifice other things?  What have we done?

Well, replication -- Steve talked about that.  Replication is a pretty common technique, either passive replication to say, you know, you can do reads over here, or we’ve got this for DR cases ‑‑ more and more people really need multi-master replication, that’s pretty scary because it does require a lot of coordination of those transactions.  It does put a lot of burden back on the applications to understand how something works.  The failure models are pretty hairy, but a model that you can work with.  The other model, of course, is to say rather than trying to have a single database with some truth, let’s split the database into lots of sub-databases.  So sharding, shared nothing architectures, systems where you can work with aggregate resources very efficiently, you can scale out, but where you no longer have that single, holistic system of you, and that means you can’t run transactions across arbitrary data sets, it means you don’t have consistency across arbitrary data sets.  It means in failure models, you don’t really know what the global state of your system is.  It tends to bring with it many independent single points of failure, instead of a single point of failure, which is either awesome or horrible, depending on your point of view.  But clearly, something that we’ve been building on, and clearly something that will work perfectly well for many classes of systems, because the internet, kind of right now, is an existence proof that you can use this to solve certain problems.

But then when you look at the kinds of challenges that Steve is talking about, obviously this is not an acceptable architecture for those kinds of systems.  And I’m going to argue that increasingly, what everyone is waking up to is that other industries have had it great for years, where there’s just assumption that continuous availability and resilience and (inaudible) tolerance and all these things are just, like, they’re not high-end, bespoke things.  They’re, like, you know, this is like socialism now.  These are like God-given rights that we should all have in our systems, and that is kind of contrary to the idea of building these carefully laid out, sharded systems.

You can also go, of course, completely in the opposite direction of what Steve was talking about, and get rid of consistency.  And I’m not talking about specifically SQL transactions; I’m not talking about a specific programming model.  I just mean consistency in the sense that consistency is a way of reasoning about your data.  It’s a way of understanding when I interact with my data, what are the rules about that when someone interacts with that same data at the same time I’m interacting with it?  What are the rules about that?  When a system fails, what do you know about the state of your data?  And when the system comes back, what do you know about the state of your data?  That’s all really important, and whatever your programming model is, whatever you like to work with, you know, you can have consistency of some form, or you can completely get rid of consistency.  And it turns out when you get rid of all those rules about consistency, it’s fairly easy to build data management systems that scale out and work in this cloud world.  But it means now your application has to sort out all this complexity; it means your operator doesn’t really have visibility into how to manage those nonfunctional requirements, and you’re kind of back into a whole series of pain points.

So what are some of the side effects of doing this?  I mean, why aren’t these good, other than the fact that clearly I’m biased and I think that these aren’t good approaches?  One thing that happens is that you end up with applications that are very tightly tied to your deployment model, right?  Dev-ops is a thing that everyone is really hot on right now, really excited about.  We’ve got piles of communities and meet-ups and whatever else.  I am a hundred percent behind the idea that developers should understand what it’s like to operate a real system, like, every developer should have to wear the pager for a week.  I know we don’t have pagers anymore, we don’t have dial tones, but, like, the metaphors, hopefully still work.  Like every developer should have to wear that pager for a week, and understand that they get called at 3:00 in the morning, oh my God, it’s an emergency, everything’s broken, what happened?  Oh, I fat-fingered and lost some column.  Oh, so you’re an idiot and you’re waking me up at 3:00 in the morning to fix this?  Like, yes, you’ve got to go fix it now.  Every developer should understand that’s real, it’s painful, it’s really important.  Every operator should have to sit down and, like, learn a programming language, and understand that, like, code isn’t this thing that just happens.  It’s art, it’s hard, it’s work, it’s design.  You know, I mean, getting these communities together singing songs and whatever is a really good idea.  So I’m completely behind that aspect of dev-ops.  The thing I don’t like is that it’s kind of become this necessary thing, because we think to scale systems, we have this very tight coupling.  And that means every time something about your application logic changes, it has an effect on how you actually run your system.  It means every time you want to change something about your redundancy model, your replication factor, how you do backup, how you’re using resources, what kinds of services you’re running on -- that’s going to have an impact on your application.  And that’s expensive, it’s fragile, it’s by definition not resilient, it’s by definition not on-demand.  It’s very hard to work with, and it becomes very expensive to try to engineer back towards what these cloud architectures are.

So that’s complexity.  And it is more independent pieces, typically, whether we’re talking about replication or we’re talking about sharding, or we’re talking about non-consistent systems, we’re talking about breaking the holistic view of a service into lots of little, independent things.  It’s much harder to interpret failures; it’s much harder to understand before failures, what are the leading indicators?  What’s happening?  How do you get ahead of failures?  And it’s also, again, complexity.  And then there’s this really interesting trend that’s happening right now, and Steve mentioned this in passing, which is that more and more, no one is thinking about running a service in a single location.  Everyone I talked to today is thinking more and more about what does it mean to run with a global deployment model?  And global deployment could mean many different things for many people.  For a lot of people for many years, it’s meant essentially those active, passive or DR cases.  It’s meant replicate your data to a second data center, so if there’s some catastrophic failure, you have your data, you can get to it.  That has slowly evolved into the idea that, well, I want an active-active deployment, I want two data centers that can both take on workload, and therefore I’m not wasting resources, but I have that failover lull.  That’s evolving, though, as public cloud is becoming ubiquitous in its running in multiple different locations, as data center kind of sophistication is growing, and as models are growing more and more towards the idea that people are using the same services in multiple locations, that notion of active-active has become, well, maybe it’s not arbitrary active-active, maybe it’s that I’ve got users here on the West Coast and users on the East Coast, and I really want to give those East Coast users low latency experience and West Coaster users low latency experience.  So it’s about running in multiple locations not just for failure modes, but to provide lower latency to your users in each of those locations.

And then as you start to evolve that, you start to think about where you want your data.  Maybe those East Coast users really tend to access most of their data on the East Coast, and West Coast users are happy, like, let the East Coast -- let us, you know, in Boston stay out in Boston, and you all can stay out here and enjoy the nice weather and keep your data here, then you can simplify, right?  It’s cheaper to maintain your data only in one place, to not have to actually pay the physical cost, the dollar cost of replicating data between data centers, or using multiple places of storage.  Or it may be more specific.  It may be that you’ve got users in places where you cannot replicate their data out.  If you’re running a service here and you decide to expand it to Europe and you’ve got German citizens now using your service, and you’ve got PII on behalf of those German citizens, that data has to stay on disk in Germany.  And French citizens -- that data has to stay on disk in France.  And last week, the EU released a whole new series of recommendations that will probably become law in the next year and a half, that codify even more specificity and complexity about how those rules work, and how data has to be maintained.  So that brings with it all of these challenges about how do you build a cloud scale system while at the same time really respecting and understanding all of these local concerns?

And all of these kinds of data management questions, when you’re thinking about global operations are, in part, their trade-offs between kind of where your concerns are in terms of latency or safety and consistency.  And in traditional systems, that really means you’ve got to pick one or the other, you don’t have a lot of flexibility.  And as Steve said, you know, they’ve got this application architecture they’ve built with flexibility without kind of assumptions that every application is the same, because every application isn’t the same.  Or rather, every deployment of a given application isn’t the same.  You may write one application and then for different users, deploy it in different functions.  So you don’t want kind of one answer to this, you want flexibility here.  And what that really means is that the architectures that are going to work well in the cloud that are going to address all of these issues that I’ve been talking about are going to give you the features that you want, that are going to be the architectures that scale for global deployment, are the ones that are getting away from that original traditional type coupling of the database and the disk, and thinking about storage as a separate concern from the service that gives you access to that storage.  And when you think about that storage model as being exactly that, as storage, and you think about the service as the thing that lets you monitor all this, that lets you work with a set of data that provides those consistency rules, but gives you the flexibility to decide how, where and why you scale, you start to have core platform architectures that you can use to build all these other interesting application-level features, both the functional requirements and the nonfunctional requirements, without having to embed all of that complexity into your application.  Put --

So what I was saying, before Microsoft rudely interrupted me was, essentially there are kind of a couple of traditional architectures that we tend to lump things into when we think about distributed databases.  One of them is the shared disk approach, probably most people in this room have looked at things like Rack or DB2 pureScale, kind of other systems that scale with the idea that essentially you really still have a disk, that your optimizing [RAM?], but you make that disk perhaps bigger, more flexible than you build infrastructure around it.  Likewise, the shared nothing, or kind of sharded approach is kind of the other bucket that we tend to lump things into and say, well, that’s the other way to do it, you break things up, you assume independence, whether that’s actually physically doing that, like having lots of independent servers that don’t talk with each other, whether it’s building technologies that assume that that’s your application access model and optimize for it, which is what some databases do today ‑‑ lots of things there.  And a lot of people kind of believe that those are the two truths, so those are the two choices, and you’ve got to be one or the other.

And what we’ve done at NuoDB is step back and say, well, maybe not.  Maybe there is a third door.  And if you can find that third door, maybe there’s something really interesting behind it.  And so what we’ve built is something that we would call a durable distributed cache.  It’s a system that is a caching model, it’s something that runs in memory, assumes that running in memory is important, not just for kind of memory performance reasons, but also when the database is the thing in memory, when it is the service, not the disk.  There are all kinds of interesting optimizations you can start to apply that are very hard for those vertically-scaled disk-oriented systems.  But it’s a caching model, which means that it’s pulling things on-demand into memory; it’s not forcing you to hoist everything into memory and keep it there, so it’s much more efficient than kind of full-memory databases.  It’s much more efficient with your resources; it has that on-demand capacity, it has the ability to expand when you need it, and slow down when you don’t.  It’s distributed, meaning that it’s made up of a peer network where there’s an assumption that there is no central coordinator, no central owner for any task, that everything is independent, very egalitarian, and therefore everything is running redundantly, both in terms of your data, but also how you interact with the software.  And so it’s designed with that resilient model in mind, of always being able to fail, always being able to get ahead of failures by bringing new resources online, and being able to drive that from a policy point of view.  And it’s durable, because that’s the DN ACID.  And if you’re not durable, you’re not really providing what a modern data management system needs.  And durability isn’t just about writing data, it’s about kind of availability of that data, it’s knowing how many different places you may be writing a given object.  It’s not understanding the durability model in the face of very simple failures, and just really awful super-complicated failures.

At a high level, when you squint at NuoDB, this is really what one of our running databases looks like.  It’s not a set of independent databases that are doing replication, it’s not a master coordinator model with some other things kind of sitting off to the side.  It’s a peer to peer network that’s made up of independent processes that know how to work very efficiently with each other, that understand kind of what each other peer is doing at any given time, and therefore when one peer is doing some work, who else might care about that?  And therefore, it’s a thing that knows how to minimize the coordination; it’s a thing that understands how to work very efficiently with that cashed information to run all of the right coordination and consistency protocols.  It’s a system that takes advantage of the on-demand caching model, so that as you scale out across geographies but you have some kind of dynamic locality of access to your data; your East Coast users tend to be there, application patterns follow the path of the sun.  Users tend to travel and access their data in one location or another, but not the same locations at the same time -- all of those are access patterns that form natural locality; those locality patterns exhibit themselves in our on-demand caching model.  Our protocols run based on knowledge of where data lives in cache, and so this is a system that scales out very efficiently for these kinds of cloud architectures.

And as Michael said, you can kind of summarize that by saying, NuoDB is a standard, compliant, traditional ACID SQL database; it can do all the things that, say, an Oracle or a DB2 can do in terms of massive multi-way joins, indexes, very complicated transactions, schema, you know, all that good stuff, but designed assuming it’s all of this cloud stuff that we really care about.  It’s about resiliency, it’s about replication, it’s about kind of always being available, not just in the case of failures, but in the case of expected things, like rolling upgrades and being able to run those things with no down time.  I won’t say that NuoDB will give you 30 years of uninterrupted performance, because that would be massive hubris, but I will say that that’s the design goal that we’ve set ourselves at NuoDB, is to build an architecture that is designed to do that thing.  And I think a goal is something you aim for maybe more than something you achieve, and in that point of view, that is what we have aimed for, is exactly that set of challenges that Steve was talking about, both functionally and non-functionally.

So, summary -- I think when we’re talking about architecting for the cloud; we’re talking about some things that are really deep and very exciting, really cool.  The fact that just anyone can sit down and say the baseline for my system should be redundancy and should be kind of resiliency in the face of failure, and should be certain security models -- that’s an awesome place to be.  That’s not where we were even a few years ago, and that’s really exciting.  To get that, you have to take that leap past kind of the last general technologies, and understand what are the distributed architectures, what are the things that are designed for on-demand capabilities?  You have to think about the layering and the abstractions, you know, if you’re on Twitter at all, not a day goes by that you don’t see someone going micro-services, micro-services, micro-services -- yes, micro-services, awesome.  And for those of us who have done microkernels, that’s not a new idea; like, starting small and layering and building is, like, I think a really good design idea.  Big monolithic systems -- not such a fan of.  That goes doubly true in these kinds of architectures.  But at the same time, you know, building that thing and assuming that it will change, and assuming that, you know, five years from now, we’re going to have amazing hardware that we never knew about, that we’ll understand new things about software that we didn’t have, that the requirements will completely change in something that you assume you had to do in one place now can be pushed down or pushed up the stack.  So those kinds of layering are important, not just kind of to help with the failure models, to help with the monitoring, but to be able to, over time, evolve these systems and keep running.

If you’re interested in NuoDB specifically, is the place where you can read all about what does the product do.  You can find our documentation, you can download the product, you can read our engineering blogs to dive much more deeply into all these topics.  I think I’m going to stop there, I’m going to say thank you, Steve, and get him back up to answer some questions, and I’m going to say thank you to everyone for coming here for breakfast.  So thank you.  (applause)

So I think we’re going to take questions for a little while, and if folks have questions, we can answer them.  If people want to take offline, we’ll be here for a bit.  Happy to take things one-on-one.  But if anyone has a question, raise your hand.  Yes?

M3: Yes, maybe just to set up the question of 15 seconds of -- cloud scale, obviously, shakes out the application program models, not just, you know, we do the same thing, but we try to do it on multiple computers with the latency and resilience modes.  Like an example, if I went most likely through our connection, I am connected to one big (inaudible) in this room.  And if I want to write something like an (inaudible) team type of research that would find all these possible connections through (inaudible), the model of (inaudible) SQL, and here’s the results, say the first degree of connections and the next and the third, and if you embed it into one, it becomes probably not real.

So there’s a tremendous failure in the cloud scale {APR?] based on SQL, (inaudible) people to understand is, a few people probably who would be able to set up models for, you know, (inaudible) and such.  But if you want to move further, there’s a road map.  So NuoDB has a value proposition (inaudible).  But next step, are you planning to do anything more of a evolution in advancing APMs?  Things like (inaudible) on top of NuoDB, where all sorts of internal data structures algorithmically or computationally-friendly could be mapped through some sort of capability, and then form into the NuoDB, and then into persistence?

(Seth Proctor): Yeah, absolutely.  That’s a great series of questions. I’ll try to give you the short version, and then maybe we can talk offline in the longer version.  And then I’ll give you an answer of what we’re doing at NuoDB, and then I’d love to get Steve’s thoughts generally on kind of where he sees evolution happening.  Yes, one of the things that we’re very focused on at NuoDB is exactly what you just described, which is that things are changing in a radical way.  And 30 years ago, maybe everyone was doing SQL.  Today SQL is still a massive market, and it’s still growing at a massive rate.  And that’s, you know, tens of billions of dollars market, you know, single billions growing every year.  But even with that popularity, there are lots of good reasons to use different programming models, to use different data models.  There are different kinds of questions we want to ask, different assumptions about how you optimize those questions, and so Graph is a great example of that.

Part of what we’ve done with NuoDB specifically is, we’ve built a system that internally does not look like a relational database.  Internally, what we’ve built is what really looks like a distributed object system that understands kind of all the rules of ACID and consistency of transactions, but again, not a consistency model that’s tied specifically to SQL.  We understand that, you know, versioning is important for a multi-version concurrency control system.  We understand that, therefore, isolation levels and kind of visibility of data is important.  We understand that most things care about some form of indexing.  Most things care about, you know, kind of some form of structure, and that over time that will have to evolve.  But SQL to us, really, is the choice we made as a front end, because we believe SQL is ubiquitous, we believe SQL, for the kinds of problems Steve is talking about is a requirement to be able to move things forward.  And for many people, that’s still the right programming model for them.

M3: And it’s the only (inaudible), so (inaudible) results that sequentially (inaudible).

(Seth Proctor): That’s right.  And we support all the standard, you know, JDBC, ODBC, you know, so Hibernate and Dot Net Entity and kind of all those other ORMs, so for sure people are programming with those.  We have been building, actually for some time internally, support for some of the alternate models you talked about.  So we actually have been building a graph representation.  We have been looking at a document representation.  And we have been understanding kind of more interestingly than just, how do you say, you know, this is a software that can do something different.  You know, it’s a scale-out graph, or something like that.  I think much more interesting is, what does it mean when something that is a scale-out relational database is also a scale-out graph database.  And LinkedIn is a great example.  There are things in LinkedIn that are clearly relational problems, and things that are clearly graph problems.  And then there are traversals that probably are both.  And what does that look like?  And what does that mean?  And those are different kinds of optimizations; different things you do in terms of caching, in terms of prefetch, in terms of distribution -- Well, I have a gentleman’s agreement with our product manager; he doesn’t talk vision and I don’t talk roadmap.  And that’s why we get along so well.  But we’d be happy kind of offline, I’d be happy if you’re more interested in kind of the details of the roadmap.  Steve, I don’t know if you want to -- we’re using the mic so we can get the recording.  I don’t know if you want to comment just on general latency (inaudible) evolution.

(Steve Emmerich): We have over 50 million lines of production code that we license to our customers.  We have millions of lines of database-related code, I’m sure, embedded in there, probably 99 percent of which, or at least 95 percent of which is bound to use SQL.  The -- and that’s not going to change.  It’s not going to change forever, I would guess, because we will be living with much of that legacy code in one form or another, you know, encapsulated differently, translated to more modern environments differently.  But the data remains -- you know, the data access methods will remain similar.  But we will be changing our offerings.  We will be improving our offerings.  We will be inventing new products that don’t -- are not necessarily bound to the SQL language as we go.  And so I think that 95 percent figure will probably diminish over time, but relatively slowly.  So for a company like us that comes to the table with a lot of existing revenue-producing products that support the evolution of our architecture, you know, the revenue supports the revolution of our architecture, we’re very interested in having a full, rich support for the SQL language, but also, in some of the new options which might improve scalability or response time for specialized use cases.

(Seth Proctor): If there are no other questions, we’ll be around for a while to answer questions one-on-one.  And there’s, I think, more breakfast back there, so thanks again, everyone.  This was fun. (applause)