If you are tuned in to the latest technology concepts around big data, you’ve likely heard the term “data lake.” The image conjures up a large reservoir of water—and that’s what a data lake is, in concept: a reservoir. Only it’s for data.
A data lake holds a vast amount of raw, unstructured data in its native format.
Therefore, all you need is a device that supports a flat file system, which means you can use a mainframe if you want. The data is moved to other servers for processing. Most enterprises go with the Hadoop File System (HDFS), because it is designed for fast processing of large data sets and is used in a big data environment where a data lake is likely to be used.
That support for native-format data brings a key benefit. “If I want to get a ridiculous amount of data and figure out what to do with it later, that fits in the mantra of what we do with data lakes now,” says Michael Hiskey, head of strategy at Semarchy, a vendor of data management software.
“We have things known and unknown that people on the data lake side are taking keep everything that might be interesting and take order out of madness later. We could not guess today what’s valuable from the things I’m throwing away, but that could turn out to be interesting in the future,” he says.
Jake Stein, CEO of Stitch, an ETL service that connects multiple cloud data sources, echoed the future-proofing sentiment. “If you’re not sure when you’re going to use the data and it’s not important to have subsecond access and want to store it in a low-cost form, the data lake is the right format. It’s often a case of if you don’t capture the data now, you will never get it again, so it’s important to future=proof yourself in that aspect.”
Data repositories are nothing new; data warehouses have been around for decades. And while it is natural to compare data warehouses to data lakes, there are fundamental differences that separate data warehouses from data lakes, ranging from the kind of data stored to how it is processed.
One of the key differences between a data lake and a data warehouse is that a data lake does not require special hardware or software, unlike a data warehouse.
As noted, a data lake holds a vast amount of raw, unstructured data in its native format, whereas the data warehouse is much more structured into folders, rows, and columns. As a result, a data lake is much more flexible about its data than a data warehouse is.
That’s important because of the 80 percent rule: Back in 1998, Merrill Lynch estimated that 80 percent of corporate data is unstructured, and that has remained essentially true. That in turn means data warehouses are severely limited in their potential data analysis scope.
Hiskey argues that data lakes are more useful than data warehouses because you can gather and store data now, even if you are not using elements of that data, but can go back weeks, months, or years later and perform analysis on the old data that might have been otherwise discarded.
A flexibility-related difference between the data lake and the data warehouse is schema-on-read vs. schema-on-write. A schema is a logical description of the entire database, with the name and description of records of all record types.
A data warehouse applies schema-on-write, so you have to know exactly how to structure the data before you save it. That means a lot of preparation before intake, or at least before storage. By contrast. data lakes apply schema-on-read, so you can format it as you read and process it. Schema-on-read means you can throw everything into a bucket, like log files, web files, or things with no meaningful structure, and then figure it out later.
“A data warehouse is highly structured. You have to really understand the data before you do anything on it,” said Joe Wilhelmy, director of data engineering at the American Associate of Insurance Services (AAIS). “With a data lake, you can bring it iteratively through a maturity cycle from raw source data to structured projection. You can see it along the way don’t have to be beholden to data engineers and IT to productize that data before it’s usable.”
Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags. When someone performs a business query based on a certain metadata, all the data tagged is then analyzed for the query or question.
Unlike a data warehouse, data lakes don’t have an underlying database. Instead, data lakes use a flat file system. With a database, you have to choose data and columns before you write to it. The trade-off is that it might take a while to insert the data into a database, but when you do a query it is a lot faster than in a data lake, which has to process the data as it is read.
“With a data lake, you can put data into a store any way you like. That allows you to write data with a flexible schema and query later, but orders of magnitude slower,” said Stein. “The one element those servers don’t do well is metadata management. Things like what goes in which folder, when is it aged out. You have to roll your own when doing a service like that.”