The Right Data Storage Strategy: What you need to know between Data Warehouses and Data Lakes to make the right data storage decision for your data science project
Is it just about Data Warehouses or Data Lakes?
Choosing the right data storage strategy for data science is important. A lot has been written about Data Warehouses and Data Lakes, and which approach is better for data scientists to build, learn, test and deploy their Artificial Intelligence (AI) and Machine Learning (ML) algorithms. Complicating the issue are the latest data management advancements that are now blurring the line between the Data Warehouse and the Data Lake and offering the possibility to efficiently scale to exabytes.
In this post we will quickly go over which use-cases the Data Warehouse and Data Lake have traditionally been designed for, and the new data management technologies data scientists can now choose from. We will then focus on which approach would be better for AI.
Data Warehouses and Data Lakes are both used for storing large amounts of data. Both use data extracted from transactional systems, IoTs and external data sources. Both can keep large amounts of historical data, so data scientists can perform trend analysis and compare today’s numbers with past data.
The data warehouse data storage strategy we know
Data Warehouses were designed for querying and advanced analytics. They utilize highly structured relational data models, well adapted for efficient high-speed queries (reading the data). The purpose of the data in the Data Warehouse is pre-defined to meet specific business goals. Hence, only useful data is collected from the source systems. It is then cleaned and processed with those specific business use-cases in mind, before storing. But, is it the right data storage strategy? Data warehouses are highly structured, they need to be carefully designed, they take a long time to update (writing the data), they are not easily modifiable, and they don’t scale as naturally as Data Lakes.
The data lake data storage strategy we know
Data Lakes were designed to store (very) large amounts of raw data. The purpose of the data in the Data Lake doesn’t necessarily need to be pre-defined. Users access the data and explore it, how they see fit. Data in most cases flows from the source systems to the data Lake is in its natural state, or nearly-untransformed. All data is welcomed: data that may never be used or might someday be used. Data is then stored in a flat unstructured architecture, which enables rapid updates (writing the data), and allows the Data Lake to massively scale horizontally. But, because of their design, Data Lakes are not efficient for high-speed queries (reading the data) and might not be the right data storage strategy for data science.
The new meta warehouse we need to know about
Some organizations opt for a Data Warehouse and a Data Lake data storage strategy, in order to take advantage of both models and cover all the use-cases data scientists might come up with. Both can coexist just fine, but this approach quickly becomes doubly expensive to maintain and scale.
A more economical data storage strategy involves applying more structure to the Data Lake, in order to help speed-up the queries. It entails building a Meta Data Warehouse on top of the Data Lake. The Meta Warehouse, indexes the data in the Data Lake and applies structured views. The Data Lake can be built using open-source systems like ApacheTM Hadoop® or other cloud object stores like AWS® S3, Microsoft® Azure Data Lake or GCS. These systems can handle any data type and scale very well. ApacheTM Hive® which provides structured views of the data, can be used for the Meta Warehouse. Other systems like Snowflake®, or Delta Lake from Databricks®, also provide these kinds of hybrid solutions. But, these architectures have shortcomings. Organizations find themselves needing to clean the data from their operational systems, after it is ingested in the Data Lake. As the Data Lake grows larger and larger, cleaning it up becomes very expensive.
The Hadoop Distributed File System (HDFS), the founding layer of the Hadoop Framework, was designed as a distributed file system to work with larger blocks of data than traditional file systems in order to achieve faster i/o operations when dealing with very big data sets. Hive maintains a “map” that keeps track of where the data is stored — which data is in which file. It also manages the SQL queries to the underlying storage layer, e.g. HDFS. However, Hive is great for terabyte size databases, but runs into limitations when your data grows to petabytes or exabytes. Although Hive in most set-ups, runs the i/o operations on data subsets (as opposed to the whole Data Lake), it still stores all the meta data, including all the different schemas, in one centralized meta store. This central meta store data storage strategy architecture can become a bottleneck, and prevent Hive from scaling efficiently.
This blog post is an excerpt an article posted by Paul Sinaï on Towards Data Science. Read the full article. Discover the ForePaaS machine learning platform.
For more articles on cloud infrastructure, data, analytics, machine learning, and data science, follow me (Paul Sinaï) on Towards Data Science.
Get started with ForePaaS for FREE!
Discover how to make your journey towards successful ML/ Analytics – Painless
The image used in this post is a royalty free image from Unsplash