Profile PictureSaikat Dutta

Data Engineering Beyond Hype #3

Data Pipeline Architecture

The variety of ways you can build your Data Pipeline

Read Time: 9 min

I would like to apologize for this edition being delayed by 2 weeks.

The topics I wanted to cover in this edition, are vast and diverse. 

I got lost in finding the best possible literature. I felt a considerable gap in how they are structured.

So I took a week extra than planned, but without further ado, let’s start.

This is a long email, but probably this is an exhaustive reference for Data Pipeline Architecture.


Recap

In the last episode, we learned what Data Engineering is, and what are the best practices for Data Ingestion.

We discussed batch, streaming, and lambda architecture for data processing.

We discussed 7 best practices of Data Ingestion.

In this episode, we will talk about Data Pipelines as a whole.


Why Data Pipelines:

Pipelines are used for transferring something from one point to another.

However, a Data Pipeline does much more.

A Data pipeline is a sequence of steps that collects raw data and converts it into actionable insights or information about the business.

But why do we need Data Pipelines?

Source systems in a large enterprise are extremely varied. For example, a large enterprise might collect data from 1. Different Transactional Systems, 2. Social Media, 3. Documents 4. Videos 5. ERP systems 6. Marketing Systems, ex. Hubspot 7. 3rd party analytics platforms, like Google Analytics.

Hence, RAW data from these systems might have different protocols and different structures. It is extremely complicated to try to standardize them for analysis.

Data Pipelines are used to collect all the data from disparate sources into a unified system.

Once the data is unified, Data Pipelines also help standardize the data, remove invalid data, cleanse it and perform aggregation and transformations based on subsequent downstream requirements. 

Types of Data Pipeline

Data Pipeline categorization can be done based on different parameters. However, let us consider different types based on schedule, the freshness of data, and the architecture of data processing.

  • Batch Processing: Data Pipelines are scheduled at specific intervals. Useful for daily dashboards or periodic reporting. Data freshness is low.
  • Stream Processing: Data Pipelines execute continuously as data is streamed. Used when the difference between event time and processing time needs to be minimized e.g., tracking anomaly, fraud, dynamic pricing, etc. Data is fresh but often not accurate.
  • Batch + Stream Processing: Combines the power of long-running batch models with streaming analytics. Data is still fresh and inconsistencies are removed by using output from Batch processing.

What are the main components of a Data Pipeline?

  1. Sources
  2. Destination / Sinks
  3. Transformations of Data and Business Logic
  4. Data Quality Validation

There can be more components, e.g: Data Modelling, Data visualization, however, these 4 are the basic building blocks.

So how do you use these building blocks to solve different use cases in ETL?


Let's look at different Data Pipeline Architectures to achieve this:

RAW DATA LOAD

As the name suggests these pipelines are used to move data in raw format. The data is unchanged. Involves only Extract and Load, no transformations are applied. These are typically run only once or in specific schedules per day. 


Pros:

  1. Data Movement from one DB to Data Lake, From on Premise Data Store to object storage on the cloud.
  2. Data is typically loaded in compressed open file formats like Parque/ORC/Deltalke/Iceberg etc. Hence less storage is occupied in Datalake.
  3. Helps to make the data transformation pipelines idempotent. Ex. Even if the subsequent data transformation pipelines are run multiple times in the day or rerun, they will result in the same data.
  4. Helps to debug source data issues, without referring to source data multiple times.
  5. Suited for Data Lake workloads and mostly on cloud only.

Cons:

  1. Duplication of data in another layer.
  2. Not suited for on-premise Data Warehouse workloads.

Extract Transform Load ( ETL ) :

Extract-Transform-Load (ETL) as is the most widely used data pipeline pattern. From the early 1990’s it was the standard to integrate data into a data warehouse. 

Data is extracted from operational data stores and then unified, cleaned, and transformed as per business needs. Data is also modeled to replicate real-world entities.

Ex. Loan application data is taken from loan origination platforms, loan payments, transactions are taken from loan receivable platforms, and credit and other 3rd party data are taken from flat files, and data feeds. 

All these data are then segregated to form Loan Fact, Loan Transaction Fact, Loan payment fact, different dimension tables related to the loan, etc.

Pros:

  1. Suited for Data Warehouse workloads. 
  2. Works well when complex data transformations are required.
  3. Results in only One Copy, as no redundant raw data layer is needed.
  4. If all sources are not ready at the same time, ETL can be useful. The transformation will happen only when all data is available.

Cons:

  1. Highly Time consuming and has High Latency.
  2. Micro batch processing can be done, but still latency is inevitable.
  3. High computing power is needed for complex transformations.
  4. Due to 1 and 2, it's not practical with Big Data.
  5. You often have vendor lock-in with ETL tools and Data Warehouse providers.

Extract Load Transform ( ELT ):

Extract-Load-Transform, became popular as a replacement for ETL with an increase in data volume. Waiting for all transformation work to complete delays the availability of data for business use. Hence data is loaded immediately after extraction. AI/ML processing can happen without data modeling.

Pros:

  1. Suitable Data Lake implementations
  2. Decreased latency as data is available much sooner for processing.
  3. All kinds of data structured, instructed, etc can be loaded.

Cons:

  1. Increased duplication of data in different layers. Results in two+ copy(Raw and Transformed layer / optional serving layer)
  2. Data Quality Suffers, data is exposed without deduplication and cleansing.
  3. Often ends up creating a Data Swamp.
  4. People often forget or ignore the T — Transformation.

Extract Transform Load Transform ( ETLT ):

The Extract-Transform-Load-Transform (ETLT) pattern is a hybrid of ETL and ELT. The first stage involves light transformations, independent of all other data sources. Data cleansing, format standardization, and masking of sensitive data are typical kinds of first-stage light transformations. 

Pros:

  1. Smaller transformation, data cleaning, and filtration is done while extracting data.
  2. very practical pipeline design solves data privacy through masking.
  3. As some filtration or transformation is done initially, the chances of data becoming a swamp is less.
  4. Enables Lakehouse architecture.

Cons:

  1. Still limited by processing power and latency.
  2. Still generates two copy.

All the variations of ELT/ETL and Data Lake or Warehouse are limited by 

  1. Centralized Data team
  2. Increased time to provision new data sources
  3. Lesser flexibility is serving different departments and use cases ( ML/AI/Marketing/Sales/HR etc.)

Data Mesh and Data Fabric architecture are modern alternatives to Data Lake / Data Warehouse / Lakehouse-based centralized processing. The key enablers for Data Mesh/Fabric are Data Virtualization and Dedicated Links.

DATA VIRTUALIZATION:

Data virtualization, illustrated in figure 6, serves data pipelines differently than the other patterns. Most pipelines create physical copies of data. 

Virtualization delivers data as views without physically storing a separate version. 

Virtualization works with layers of data abstraction. The source layer is the least abstract. The integration layer combines and connects data from disparate sources providing views similar to the transformation results of ETL processing. 

The business layer presents data with semantic context, and the application layer structures the semantic view for specific use cases.

  1. Unlike ETL processing virtualization processes are initiated by a query. 
  2. The query is issued in the application and semantic context and is translated through integration layers and source layers to connect with the right data sources. 
  3. The response traverses the path in the opposite direction to acquire source data, transform and integrate, present a semantic view, and deliver an application view of the data. 

Pros:

  1. Virtualization can work well when people want the freshest data possible. 
  2. Virtualization integrates and transforms only data that is requested and used
  3. Zero Copy architecture, No data replication.

Cons:

  1. Only good for modest data volumes
  2. May struggle to perform with complex transformations or huge data.

DATA STREAM PROCESSING

The best alternative for low latency ETL is Data Stream Processing.

Data Strem processing involves a streaming data source, ex. logs of sensors and IOT devices, location, and trip data from moving vehicles in logistics, etc.

Stream processing involves filtering, and transforming a continuous stream of data.

Processing begins by parsing the events to isolate each unique event as a distinct record and only events relevant to the use case are processed further.

At the destination end of the data flow, there are two different patterns that emerge. 

For some use cases, the event is posted to a message queue or an event log, or an events archive. Once posted or stored, the downstream data pipeline does data consumption later. 

For other use cases, the events are pushed to monitoring or alerting applications. Ex. real-time anomaly in freight carriers, dynamic price based on demand, etc.

Pros:

  1. Stream processing pipelines work with sensors and Internet of Things (IoT) devices.
  2.  This pattern is particularly useful for machine learning.
  3. Useful for log analytics
  4. Data Freshness is high

Cons:

  1. In absence of Lambda architecture, data accuracy is not high. 

CHANGE DATA CAPTURE

CDC along with an OLAP use case can solve data latency issues. CDC allows fresh data to be available for processing in ETL pipelines.

There are mainly two types of CDC applications: 

  1. Transaction date based 
  2. Database log based 
  3. Trigger-based 
  4.  Differential Method Based ( Diff ) 

Pros: 

  1. Reduced latency for Datawarehouse workloads.
  2. Increased data freshness, near real-time or micro-batch ETL.

Cons:

  1. Increased complexity.
  2. CDC applications can add extra cost.

Now that we have covered the different types and architectures of pipelines, there are a few common design patterns that apply to almost all processing techniques. 

Repeatable and Idempotent:

It's important that ETL pipelines are able to re-run from a fixed timeline and produce the same results. This is extremely important in order to debug failures or load data post failures.

  1. Store data in an immutable way, such as a list of transactions with date and time.
  2. This will allow the computation to be idempotent and produce the same results no matter if run retrospectively.

Automated Framework — Templatize

It's extremely important to automate data pipelines using an orchestration tool / Dags/corn jobs. The pipelines should resolve all dependencies and run on their own.

Without automation, manual error and delay will burden the workforce and kill data quality.

Templates are a must-have for different data pipelines. This allows for faster provisioning of new sources, and reduced development and test cycles for data pipelines. 

Auditable — (Monitoring and logging)

It's important to learn at any point in time what pipelines are running, what is completed successfully, and what failed. In case of any failures, clear logging should be in place.

This will drastically reduce debugging and error handling times and effort. 

This should also allow for Data Obesrvability and Data Lineage, to allow everyone to see what transformations have been applied through which pipelines.

Reliable and Secure ( Data Validation and QA)

Data Quality is a key parameter to look for to validate the results of a Data Pipeline. The pipelines should be designed in a way that the output is reliable and that maximum data quality is achieved.

DQ check must be done in both source and destination, and alerting must be configured.

This will build confidence among the stakeholders to use the data for downstream reporting.

Conclusion:

This is probably an extremely exhaustive list of architectures for Data Pipeline covered in a single blog. I will cover the practical implementation of some of these in the future in a separate article.

For beginners, this might seem like too much. Hence do take it slow, cover one at a time and come back to the blog whenever you need any references. 

However, these are fundamental building structures for Data Pipelines and are tool agnostic. If we build strong foundational knowledge of each and are able to different pros and cons, that will help us to build the pipelines in any ETL/ELT/DI/Orchestration tools.

References:

  1. Data Virtualization — How it works.
  2. Data Pipeline design patterns.
  3. “Data Pipeline Design Principles” by Kovid Rathee 
  4. 3 Design Principles for Engineering Databy Jiaqi Liu
  5. Data-Mesh-and-Data-Virtualization-are-not-the-Same-Thing by Terada.

See you again next week.

Whenever you’re ready, there are 3 ways I can help you in your Career Growth:

  1. Let me help you mentor with your career journey here.
  2. Download an exhaustive Data Engineering Roadmap here.
  3. Grow Your LinkedIn brand and get diverse here.

0 comments

Current user avatar

Data Engineering Beyond Hype #4: 60 Years of Data Processing

Data Engineering Beyond Hype #2

See all posts from Saikat Dutta

Powered by