What are Data Pipelines?
This is the 1st part of a series of two blog articles called “Data Pipelines Vs. ML Pipelines – Similarities, and Differences”. To read the second part, please click here.
An excellent place to start is understanding what Data Pipelines are and how they differ from ML Pipelines. Enterprises of all sizes and across numerous industries are now looking at Machines Learning (ML) to help them better understand and serve their customers, gain new production and sales insight, and get an edge on their competitors. But building an ML solution is a complex undertaking. Before embarking on an ML project, doing your homework is highly recommended. An excellent place to start is understanding what Data Pipelines are and how they differ from ML Pipelines. This two-part series blog will look at the differences and similarities between these two types of pipelines.
For simplification, we will not distinguish Artificial intelligence (AI) from Machine Learning (ML) and Deep Learning (DL), as the pipelines are similar, and talk about ML to encompass all three.
How are data pipelines and machine learning pipelines similar
The Data Pipeline, used for Reporting and Analytics, and the ML Pipeline, used to learn and make predictions have many similarities. Data Engineers build Data Pipelines for Business Users, whereas Data Scientists construct and operate the ML Pipeline. Both pipelines access data from corporate systems and intelligent devices and store the collected data in data stores. They both go through data transformation to scrub the raw data and prepare it for analysis or learning. Both keep historical data. They both need to be scalable, secure and hosted on the cloud. Both need to be monitored and maintained regularly.
However, several essential characteristics distinguish them. To better understand them, we will first review the main features of the Data Pipeline before describing in detail the ML Pipeline in part 2 of this blog series.
Definition of a data pipeline
The Data Pipeline comprises several specific modules and processes designed to enable reporting, analysis, and forecasting capabilities. The Data Pipeline moves data from an enterprise’s operational systems to a central data store on-premise or in the cloud. Data from various connected devices and IoT systems can also be added to the pipeline for specific business cases.
Continuous maintenance and monitoring are essential to make the Data Pipeline modules and process run smoothly and correctly. Problems that arise must be quickly resolved, and the systems (software, hardware, and networking components) they use to be updated. Data Pipelines are also often adjusted to reflect business changes.
The Data Pipeline
As shown in the figure above, the basic architecture of a Data Pipeline can be split into three main modules: Data Extraction, Data Storage, and Data Access. We quickly look at each of these modules below.
1. What is data extraction in the data pipeline
Retrieving data from the enterprise operational systems and connected devices is the first module of the Data Pipeline. At this point of the process, the pipeline collects raw data from numerous separate enterprise data systems (ERP, CRM), production systems, and application logs. Extraction processes are set up to extra the data from each data source.
Two types of extraction mechanisms are possible. According to the Data Engineers ‘ criteria, batch processes can be used to ingest the data assets of records. These can be run on a set schedule or be triggered by external factors. Streaming is an alternative data ingestion paradigm where data sources automatically pass along individual records or units of information one by one. Most enterprises use streaming ingestion only when they need near-real-time data.
Once the raw data is ingested, two possible next steps are acceptable, depending on an enterprise’s business use-case and operational infrastructure. The first possibility is to transform the raw data before loading it into a central data store. The second is to store the basic information first, then change it into the main data store.
The first possibility is more conventional and commonly referred to in the Data Warehouse process as Extraction, Transformation, and Loading (ETL). In this case, the data is transformed to match a unified data format (defined by the enterprise’s business needs) and then loaded into a central data store, called a Data Warehouse. Enterprises with ERP and CRM systems do not necessarily need to duplicate their raw operational data in a significant raw data store. They rather could keep it at the operational level in their ERP and CRM data stores.
The other more modern approach, called Extraction Loading and Transformation (ELT), loads the raw operational data directly into a central data store, called a Data Lake, before transforming it. This approach is gaining in popularity as data storage costs drop. In this paradigm, the Data Lake becomes the central raw historical data store, and the data transformation takes place while the raw data is moved from the Data Lake to the Data Warehouse.
In either case, what is important is to transform the raw data into actionable data that can be used for decision-making. Raw data from disparate sources is often inconsistent and must be modified to meet joint querying and analysis requirements, making the data transformation process critical.
2. What is data storage in the data pipeline
Whether an enterprise decides to store its raw data in a Data Lake before transferring it into a Data Warehouse depends on its business needs and policies. Data Lakes are not a good fit for interactive Analytics and Reporting. They’re more useful for traceability and auditing. Another motivator is that storage and computing prices have become relatively cheap, allowing enterprises to reap the benefits of having a Data Lake and a Data Warehouse built on top of it.
The typical data store structure of the Data Warehouse is a Relational Database Management System (RDBMS). These specialized databases are essential, as they contain all of an enterprise’s cleaned and mastered data in a centralized location, serving as an enterprise’s single source of truth.
Another essential part of the Data Warehouse is Metadata. Metadata adds business context to the data. It also complements the data with additional information about the transformations applied to the raw data before loading it to the Data Warehouse.
3. What is data access in the data pipeline
Data access is the last module in the Data Pipeline. Data access tools are used to query the data in the Data Warehouse. There are a lot of different data access tools, each designed to meet an enterprise’s department’s specific business needs.
Access tools range from the ad-hoc query and reporting tools to the more advanced parameter-driven Analytics applications. These Analytics applications, also referred to as Business Intelligence applications, are designed to hide the complexity of the data for Business Users and Executives by presenting the data in understandable business terms. Other notable Analytics applications are the Online Analytical Processing Applications (OLAP). An OLAP stores data sub-sets of the Data Warehouse into specialized Data Marts for specific department analytics like finance or sales. Other more refined OLAP solutions use these Data Marts for Data Modeling and Analytical Forecasting.
Thank you for reading the first part of this two-part blog. To continue reading the second part, please click here: 2nd part.
For more articles on data, analytics, machine learning, and data science, follow me on Towards Data Science.
Get started with ForePaaS for FREE!
Discover how to make your journey towards successful ML/ Analytics – Painless