First Part: Introduction and Data Pipelines
A Series of 2 blog articles
This is the 1st part in a series of two blog articles called “Data Pipelines Vs. AI Pipelines, The Similarities and Differences – Analyzed”. If you are looking for the second part, please click here.
Enterprises of all sizes and across numerous industries are now looking at AI to help them better understand and serve their customers, gain new production and sales insight, and get an edge on their competitors. But building an AI solution is a complex undertaking. Before embarking on an AI project, doing your homework is highly recommended. A good place to start is by understanding what a Data Pipeline is and how it differs from an AI Pipeline. In this two-part series blog, we will look at the differences and similarities between these two types of pipelines.
For simplification, we will not distinguish Artificial intelligence (AI), from Machine Learning (ML) and Deep Learning (DL), as the pipelines are similar, and just talk about AI to encompass all three.
The Data Pipeline, which is used for Reporting and Analytics, and the AI Pipeline, used to learn and make predictions, have many similarities. Data Engineers build Data Pipelines for Business Users, whereas Data Scientists both build and use the AI Pipeline. Both pipelines access data from corporate systems and smart devices, and store the collected data in data stores. They both go through a process of data transformation to scrub the raw data and prepare it for analysis or learning. Both keep historical data. They both need to be scalable, secure and can be hosted on the cloud. Both need to be monitoring and maintained on a regular basis.
However, there are several important characteristics that distinguish them. To better understand them, we will first review the main characteristics of the Data Pipeline before describing in detail the AI Pipeline in part 2 of this blog series.
The Data Pipeline
The Data Pipeline is made up of several distinctive modules and processes designed to enable an organization’s reporting, analysis, and forecasting capabilities. The Data Pipeline moves data from an enterprise’s operational systems to a central data store on premise or in the cloud. For certain business cases, data from various connected devices and IoT systems can also be added to the pipeline.
Continuous maintenance and monitoring are important to make the Data Pipeline modules and process run smoothly and correctly. Problems that arise must be quickly resolved and the systems (software, hardware, and networking components) they use be updated. Data Pipelines are also often adjusted to reflect business changes.
The Data Pipeline
As can be seen in the figure above, the basic architecture of a Data Pipeline can be split into three main modules: Data Extraction, Data Storage, and Data Access. We quickly look at each of these modules below.
1. Data Extraction
Retrieving data from the enterprise operational systems and connected devices is the first module of the Data Pipeline. At this point of the process, the pipeline is collecting raw data from numerous separate enterprise data systems (ERP, CRM), production systems and application logs. Extraction processes are set-up to extra the data from each of these data sources.
Two types of extraction mechanisms are possible. Batch processes can be used to ingest the data as sets of records according to criteria set by the Data Engineers. These can be run on a set schedule or be triggered by external factors. Streaming is an alternative data ingestion paradigm where data sources automatically pass along individual records or units of information one by one. Most enterprises use streaming ingestion only when they need near-real-time data.
Once the raw data is ingested, two possible next steps are acceptable, depending on an enterprise’s business use-case and its operational infrastructure. The first possibility is to transform the raw data before loading it into a central data store. The second is to store the raw data first, then transform it and store it a central data store.
The first possibility is more conventional, and commonly referred to in the Data Warehouse process as Extraction, Transformation and Loading (ETL). In this case, the data is transformed to match a unified data format (defined by the enterprise’s business needs), and then loaded into a central data store, called a Data Warehouse. Enterprises with ERP and CRM systems do not necessarily need to duplicate their raw operational data in a central raw data store. They rather could keep it at the operational level in their ERP and CRM data stores.
The other more modern approach, called Extraction Loading and Transformation (ELT), is to load the raw operational data directly into a central data store, called a Data Lake, before transforming it. This approach is gaining in popularity as data storage costs drop. In this paradigm, the Data Lake becomes the central raw historical data store, and the data transformation takes place while the raw data is moved from the Data Lake to the Data Warehouse.
In either case, what is important is to transform the raw data into actionable data that can be used for decision-making. Raw data from disparate sources is often inconsistent, and must be modified to meet common querying and analysis requirements, hence making the data transformation process a critical function.
2. Data Storage
Whether an enterprise decides to store their raw data in a Data Lake or not, before transferring it into a Data Warehouse, really depends on its business needs and policies. Data Lakes are not really a good fit for interactive Analytics and Reporting. They’re more useful for traceability and auditing. Another motivator is that storage and computing prices have become relatively cheap, allowing enterprises to reap the benefits of having both a Data Lake and a Data Warehouse build on top of it.
The typical data store structure of the Data Warehouse is a Relational Database Management System (RDBMS). These specialized databases are important, as they contain all of an enterprise’s cleaned and mastered data in a centralized location, serving as an enterprise’s single source of truth.
Another important part of the Data Warehouse is the Metadata. Metadata adds business context to the data. It also complements the data with additional information about the transformations applied to the raw data before it was loaded to the Data Warehouse.
3. Data Access
Data access is the last module in the Data Pipeline. Data access tools are used to query the data in the Data Warehouse. There are a lot of different data access tools, each designed to meet and enterprise’s department’s specific business needs.
Access tools range from the ad-hoc query and reporting tools, to the more advanced parameter-driven Analytics applications. These Analytics applications, also referred to as Business Intelligence applications, are designed to hide the complexity of the data for Business Users and Executives, by presenting the data in understandable business terms. Other types of notable Analytics applications are the Online Analytical Processing Applications (OLAP). An OLAP stores data sub-sets of the Data Warehouse into specialized Data Marts, for specific department analytics like finance or sales. Other more refined OLAP solutions make use of these Data Marts for Data Modeling and Analytical Forecasting.
Thank you for reading the first part of this two-part blog. To continue reading the second part, please click here: 2nd part.