The New Generation Data Lake

The New Generation Data Lake

The petabyte architecture you cannot afford to miss


This blog post is an excerpt an article posted by Paul Sinaï on Towards Data Science. Read the  full article.


The volumes of data used for Machine Learning projects are relentlessly growing. Data scientists and data engineers have turned to Data Lakes to store very large volumes of data and find meaningful insights. Data Lake architectures have evolved over the years to massively scale to hundreds of terabytes with acceptable read/write speeds. But most Data Lakes, whether open-source or proprietary, have hit the petabyte scale performance/cost wall.

Scaling to petabytes with fast query speeds requires a new architecture. Fortunately, the new open-source petabyte architecture is here. The critical ingredient comes in the form of new table formats offered by open source solutions like Apache HudiDelta Lake, and Apache Iceberg. These components enable Data Lakes to scale to the petabytes with brilliant speeds.


To better recognize how these new table formats come to the rescue, we need to understand which components of the current Data Lake architectures scale well and which ones do not scale as well. Unfortunately, when a single piece fails to scale, it becomes a bottleneck and prevents the entire Data Lake to scale to the petabytes efficiently.


We will focus on the open-source Data Lake ecosystem to better understand which components scale well and which ones can prevent a Data Lake from scaling to the petabytes. We will then see how Iceberg can help us massively scale. The lessons learned here can be applied to proprietary Data Lakes.


Continue reading the full article on Towards Data Science.


Try the ForePaaS Platform for free:


Try It For Free