Don’t Jump in the Data Lake

32. 47. 19. 7. 85.


Congratulations! I just gave you five very important, valuable numbers. Or did I?

If they were tomorrow’s winning Powerball numbers, then certainly. But maybe they’re monthly income numbers. Or sports scores. Or temperatures. Who knows?

Such is the problem of context. Without the appropriate context, data are inherently worthless. Separate data from their metadata, and you’ve just killed the Golden Data Goose.

If we scale up this example, we shine the light on the core challenge of data lakes. There are a few common definitions of data lake, but perhaps the most straightforward is a large object-based storage repository that holds data in its native format until it is needed or perhaps a massive, easily accessible, centralized repository of large volumes of structured and unstructured data.

True, there may be metadata in a data lake, thrown in along with the data they describe – but there is no commonality among such metadata, and furthermore, the context of the information in the lake is likely to be lost, just as a bucket of water poured into a real lake loses its identity.

If data lakes present such challenges, then why are we talking about them, and worse, actually implementing them?
The main reason: because we can.

With today’s data collection and storage technologies, and in particular Hadoop and the Hadoop Distributed File System (HDFS), we now have the ability to collect and retain vast swaths of diverse data sets in their raw, as-is formats, in hopes that someone will find value in them down the road “just-in-time” – where any necessary processing an analytics take place in real-time at the time of need.

This era of data abundance is relatively new. Only a handful of years ago, we had no choice but to transform and summarize diverse data sets ahead of time in order to populate our data marts and data warehouses.

Today, in contrast, we can simply store everything, ostensibly without caring about what such data are good for or how they are organized, on the off chance that someone will come along and find a good use for them.

Yet as with the numbers in the example above, data by themselves may not be useful at all. Simply collecting them without proper care may not only lead to large quantities of useless information, but might in fact take information that may have been useful and strip any potential usefulness from it.

The Dark Underbelly of Big Data

This dumbing down of the information we may collect is the dark underbelly of the big data movement. With our mad rush to the quantity of data we can collect and analyze, we risk foregoing the quality of those data, in hopes that some new analytics engine will magically restore that quality.

We may think of big data analytics as analogous to mining for gold, separating the rare bits of precious metal from vast quantities of dross. But we’ll never find our paydirt if we strip away the value during the processes of data collection and analysis.

Perhaps we should go back to the Online Analytical Processing (OLAP) days, where we carefully process and organize our information ahead of time, in order to facilitate subsequent analysis. Even with today’s big data technologies, there are reasons to remain with such a “just-in-case” approach to data management, rather than the just-in-time perspective of data lake proponents.

In reality, however, this choice between just-in-case and just-in-time approaches to data management is a false dichotomy. The best approach is a combination of these extremes, favoring one or the other depending on the nature of the data in question and the purpose that people intend to put them toward.

Moving Up to the Logical Layer of Abstraction

To understand how to approach this decision, it is essential to move up one layer of abstraction. We don’t want to focus our efforts on physical data lakes, but rather logical ones.

With a physical data lake, we can point to the storage technology underlying the lake and rest assured we’ve precisely located our information. With a logical data lake, we can consider multiple data stores – perhaps in one location, or possibly scattered across the cloud, static or in a state of flux – as a single, heterogeneous data lake.

As a result, we only move information when there’s a need to do so, and we only process such information when appropriate, depending upon the goals of the task at hand. Such movement and processing can happen beneath the logical abstraction layer, invisible to the users of analytics tools.

The key to making the logical data lake work properly – at speed and at scale – is to implement the appropriate architecture. And yet, as our technology advances increasingly rapidly, our essential architectural best practices tend to lag behind. The result is advanced technology that nobody knows how to use properly. Such is the situation today with data lakes.

Yes, we can pour all our multi-structured data into a data lake. We have the technology. Filling the lake is no problem. But unless we leverage appropriate architecture to maintain the context for those data, we’ll have nothing in our lake but a whole lot of numbers.

Jason Bloomberg is the leading industry analyst and expert on achieving digital transformation by architecting business agility in the enterprise. He writes for Forbes, Wired, DevX, and his biweekly newsletter, the Cortex. As president of Intellyx, he advises business executives on their digital transformation initiatives, trains architecture teams on Agile Architecture, and helps technology vendors and service providers communicate their agility stories. His latest book is The Agile Architecture Revolution (Wiley, 2013).

Add new comment