Why Automation Matters
We announced something pretty exciting this week. If you’ve been following along on the blog you’ve seen that our Swifts release is now available. We spent a lot of time on it and the result is some really great stuff at the core of our system. What kind of stuff? A rockin’ new optimizer paired with richer indexing capabilities for one. A new set of management tools & APIs for another.
Those tools and APIs tie in with the first GA version of our automation capabilities. I’ve talked about automation on this blog in the context of several releases. I’ve also talked a little about how this gets to the usability of the system as a whole. Distributed systems are complicated, and operating one as it scales out is more so. This is why we’re so passionate about the experience of using NuoDB and making that operational experience as simple as possible.
Automation, however, is about more than just making it easier to manage our system. Yes, this is a critical element, but it goes deeper. This week I was speaking (panel and session) at Cloud Expo West which gave me the chance to catch up with peers & friends and talk about emerging trends. One of the key themes emerging in cloud computing is policy-driven definition. Whether at the infrastructure, platform or service level this essentially means moving from ad hoc orchestration to problem definitions that let our systems optimize our resources.
Bigger than a breadbox?
Why is policy-driven automation such an important trend? Well, for one thing, it helps us understand the size and scope of the problems we’re trying to solve. It also brings the right resources to bear.
Let’s say you’re spinning up host instances on a public cloud. You could choose specific instance sizes, start some known number of hosts and then watch to see if you’ve got it right. You must do a lot of measurement up-front to guess at sizing, and you’ll need to do ongoing measurement and sampling to understand whether you’re still sized correctly as your workloads change. This is why operational intelligence is another key theme right now.
The thing is, you probably started with some simple goal in mind. You knew something about the problem you were trying to solve or had a vague idea what resources similar applications require. If it’s a development or testing system then it’s solving different problems than a production service. If you’re going into production then you have some goals about numbers of users, or throughput, or uptime or other measurable details. You probably even know how to express those goals to other humans but may not know how to translate them into which resources you need to allocate.
In NuoDB, we’re driving automation based on something we call templates. This interface is still in its early days, but we’re taking a cue from SQL and designing with Declarative Definition in mind. Part of the power of SQL is that it separates expression of a query from how to actually run it: an optimizer takes the problem you want to solve and figures out the best way to make it happen. Likewise, our approach to automation is through an expression of a Service Level Agreement. In this model, you specify what requirements need to be met and we figure out how to optimize resources to address those requirements.
This certainly makes provisioning a database faster and managing a database simpler. It also cuts costs by optimizing resource utilization. The flexible architecture that NuoDB is built on means that as your workloads change the system can adapt automatically. The declarative nature of our policies mean that as your requirements change you can update your definition without affecting your running application. This is a big piece of why the industry is moving in the direction of policy-driven automation.
Relaxed under pressure.
Another reason that data-driven automation is so important has to do with resiliency. When you choose to distribute a system, to scale it from a single location to multiple places in a network, you’re trying to address specific challenges. Typically, one of these is service availability. By running on mutiple computers a service may be able to sustain more failures and still keep running.
Some services distribute by assigning ownership of specific tasks or data to specific servers. When a server fails the service as a whole keeps running but with some temporary or permanent loss to a subset of the overall capability. Techniques like failing-over to replicas are common here. Some systems implement multi-master approaches, but these are usually supported at the cost of latency and overall complexity. In any case, this often affects the ability of the system to react quickly to failures or changes in requirements.
NuoDB is built in a different fashion, driving transactions at an in-memory tier with no notion of a master or owner for any task or piece of data. New nodes can be added and existing nodes can be stopped with little overall impact on the system. From a policy and automation point of view, this means that the system can react very quickly to failure by spinning up new nodes (or using pre-allocated resources) to pick up the slack. Without some initial definition of the SLA a database is trying to address you couldn’t easily automate this process and know you’re still staying within your resource boundaries.
Building a resilient system, however, isn’t just about reacting to failure. If you have operational data that can be analyzed in real-time then you can look for leading-indicators, and if a service is driven by an SLA it can be tested against operational data to see if adjustment is needed. Beyond reacting to failure now the service can get ahead of failures or recognize that a system is under-provisioned before you hit hard scaling limits. This same model also supports compacting resources when they’re not needed or dynamically moving service components if it results in better overall resource utilization.
How do you scale a service and keep it available? Obviously you need redundancy, but just as important you need resiliency. Once a service is automated it’s free to make operational decisions that make the best use of available resources. As deployments grow increasingly distributed and complex it is this resilient nature that guarantees availability of a service while keeping management simple.
I just want to tell you both good luck. We’re all counting on you.
If you don’t recognize the quote then stop reading this right now and go watch the movie Airplane! All set now? Good.
Comedy aside, a lot of our most complex systems like airplanes are designed with auto-pilots in mind for a reason. Yes, in the worst failure-cases you want something that can kick-in and maintain control. In the normal running of the system, however, there is far too much complexity to expect a human to keep track of it all. You want to keep a human in the loop, but you want that human to stay effective and focus on the key decisions.
More to the point, resource management decisions increasingly cross domains. If a service is available in multiple data-centers then different admin teams may be involved. It’s almost certainly true that different security, access and audit rules must be considered. Automating a service shouldn’t just be about how resources like compute or bandwidth are used or what latencies have to be met. In the architectures we’re designing today an SLA must include rules about where data can be stored and how it can be accessed. The SLA is a statement about the running system today and a guarantee for forensics needs tomorrow.
Once the system knows what problems it is trying to solve, there’s an opportunity for tuning and optimization. For instance, response-time latencies could inform caching or pre-fetch strategies. Contrasting throughput requirements with predicted request rates can suggest a change in load-balancing or prioritization. These kinds of observations are often localized in a widely distributed systems, which makes it hard for any one admin to manage changes directly or understand what global effect a local change can have.
Visibility across physical and conceptual domains is where policy-driven automation really starts to shine. In this model many complex, nuanced details can be captured through simple problem-statements that are given to the system to drive. Running a service on auto-pilot frees the operator to focus on the key decisions. It unlocks resiliency and enables real-time optimization. That auto-pilot can figure out how to size & shape your deployment so you’re free to work on the problems you’re trying to solve. And yes, like I said at the start, it makes for a much nicer user experience.
So what’s next?
If you’ve clicked-through any of the links I provided then you may see this isn’t a new theme for us. NuoDB was designed from the start as a simple, composible service that lends itself to automation. We’ve spent a long time working on what problems we can help you solve with this in mind. The industry as a whole is starting to focus in this direction. We’re excited about what this means, and what a core service like a database running on auto-pilot gives you. As we head into 2015 you’ll see more on this theme. In the meantime, please take a look at our latest release and give us feedback!