Tuesday, April 21, 2026
All posts
Lv.2 BeginnerData Engineering
25 min readLv.2 Beginner
SeriesData Engineering Playbook · Part 3/7View series hub

Data Engineering Playbook Part 3: A Practical Guide to Building Data Pipelines

Data Engineering Playbook Part 3: A Practical Guide to Building Data Pipelines

This post summarizes practical pipeline architecture and operations for 2026. We cover ETL/ELT/Zero-ETL decision criteria, dbt three-layer modeling, orchestration trade-offs across Airflow, Dagster, and Prefect, and Kafka+Flink streaming patterns. We close with reliability principles and a production-ready monitoring checklist.

Series outline

  • Part 1: Overview and 2026 Trends (published)
  • Part 2: Data Architecture Design (published)
  • Part 3: A Practical Guide to Building Data Pipelines (current post)
  • Part 4: Data Quality and Governance (upcoming)
  • Part 5: Cloud Infrastructure and FinOps (upcoming)
  • Part 6: AI-Native Data Engineering (upcoming)
  • Part 7: DataOps Team Operations Playbook (upcoming)

Table of contents

  1. What is a data pipeline?
  2. ETL vs ELT vs Zero-ETL: which should you choose?
  3. Ingestion: how to bring data in
  4. Transformation: practical dbt patterns
  5. Orchestration: Airflow vs Dagster vs Prefect
  6. Streaming design with Kafka + Flink
  7. Reliability: building pipelines you can trust
  8. Monitoring and alerting
  9. Practical checklist

1. What is a data pipeline?

A data pipeline is an automated flow that collects data from multiple sources, applies required transformations, and reliably delivers data to analytical or serving systems.
In practice, it is your team's data delivery system.

Source systems (DB, API, events, files, SaaS)
-> Ingestion
-> Transformation
-> Storage
-> Serving (BI, ML, Apps)

When this system is unstable, the impact quickly spreads to dashboards, forecasting, ML quality, and operational decisions.


2. ETL vs ELT vs Zero-ETL

Core question: where does transformation happen?

ETL: Source -> Transform -> Load
- Transform in a separate processing layer first
- Still strong for strict compliance environments

ELT: Source -> Load -> Transform
- Load raw first, transform inside the warehouse
- Default pattern for modern cloud analytics

Zero-ETL: Source -> Direct integration/query -> Analysis
- Minimizes pipeline operations and data movement
- Still needs quality, governance, and semantic controls
CategoryETLELT
Transform locationSeparate processing tierIn-warehouse
Raw data retentionLimitedEasier reprocessing
Schema evolutionMore rigidMore flexible
Security/compliancePre-load masking/cleaningPost-load controls required
Cost modelExternal processing costsWarehouse compute costs

2026 recommendation

  • Start with ELT as the default.
  • Use ETL or hybrid for high-regulation data flows and legacy constraints.
  • Treat Zero-ETL as movement optimization, not a complete governance solution.

3. Ingestion: how to bring data in

Batch ingestion

ToolStrengthBest fit
FivetranManaged connectors, schema drift handlingFast SaaS onboarding
AirbyteOpen source, extensibleLower vendor lock-in
AWS GlueServerless ETLAWS-centric stacks

Streaming ingestion

Event Sources -> Kafka -> Flink -> Serving Store
- App events / IoT / CDC streams
- Targets: Iceberg, warehouse, OLAP DB

CDC (Change Data Capture) streams row-level changes from transactional databases.

PostgreSQL -> Debezium(CDC) -> Kafka -> Data Lake/Warehouse
(near real-time replication)

4. Transformation: practical dbt patterns

dbt handles the "T" in ELT with a SQL-first, software-engineering-friendly workflow.

Three-layer modeling

my_dbt_project/
  models/
    staging/
      stg_orders.sql
      stg_customers.sql
    intermediate/
      int_order_items_joined.sql
    marts/
      core/fct_orders.sql
      marketing/dim_customers.sql
  tests/
  macros/
  dbt_project.yml
  • staging: source normalization
  • intermediate: joins and reusable business logic
  • marts: final models consumed by BI/ML

Example staging model

with source as (
  select * from {{ source('raw', 'orders') }}
),
renamed as (
  select
    id as order_id,
    user_id as customer_id,
    status as order_status,
    created_at as ordered_at,
    amount_cents / 100.0 as order_amount_usd
  from source
)
select * from renamed

Example tests

version: 2

models:
  - name: fct_orders
    columns:
      - name: order_id
        tests: [unique, not_null]
      - name: order_status
        tests:
          - not_null
          - accepted_values:
              values: ['placed', 'shipped', 'delivered', 'cancelled']

Materialization guide

Table
- Good for final marts queried directly by BI/ML

View
- Good for light transformations

Incremental
- Good for large append-heavy datasets
- Requires robust unique_key and upsert strategy

Ephemeral
- Good for helper logic reused in larger models
- Compiled into CTEs, no physical table

5. Orchestration: Airflow vs Dagster vs Prefect

An orchestrator manages execution order, retries, dependencies, failure handling, and scheduling.

CategoryAirflowDagsterPrefect
Core modelDAG/TaskAssetFlow/Task
StrengthEcosystem and maturityLineage and asset semanticsPython-first UX
Trade-offHigher operational complexityLearning curveFewer very-large-scale references

6. Streaming design with Kafka + Flink

Kafka fundamentals

Topic: event category
Partition: unit of parallelism and ordering scope
Consumer Group: cooperative consumption group
Offset: record position in a partition
Retention: how long events are kept

Flink processing example

from pyflink.datastream import StreamExecutionEnvironment
from pyflink.common import WatermarkStrategy, Duration

env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(4)

transactions = env.from_source(
    kafka_source,
    WatermarkStrategy.for_bounded_out_of_orderness(Duration.of_seconds(5)),
    "Kafka Transactions"
)

Streaming lakehouse pattern

Event Sources
-> Kafka topics (durable log)
-> Flink (filter/join/window)
-> Iceberg (Bronze/Silver)
-> Trino/Spark SQL (batch/ad-hoc)
-> Redis/Cassandra (low-latency serving)

7. Reliability: building pipelines you can trust

Principle 1: Idempotency

Running the same pipeline multiple times should produce the same result.

insert into orders_summary
select date, sum(amount) as total
from orders
group by date
on conflict (date) do update
set total = excluded.total;

Principle 2: Fail fast

Stop early when data quality checks fail.

def validate_orders(df):
    if df[['order_id', 'customer_id', 'amount']].isnull().any().any():
        raise ValueError("Required column contains NULL")
    if (df['amount'] < 0).any():
        raise ValueError("Negative amount detected")

Principle 3: Reprocessability

Preserve Bronze raw data so Silver/Gold can be recomputed safely.

1) Keep Bronze immutable
2) Fix transformation logic
3) Rebuild Silver
4) Rebuild Gold marts

Principle 4: Backfill support

Support safe historical reprocessing for selected date ranges.


8. Monitoring and alerting

Four key metrics

1) Success rate
2) Data freshness
3) Volume anomaly
4) Runtime trend

Alerts must be actionable

def alert_slack_on_failure(context):
    dag_id = context['dag'].dag_id
    task_id = context['task_instance'].task_id
    log_url = context['task_instance'].log_url
    message = f"[FAIL] {dag_id}.{task_id} | {log_url}"
    send_slack_message(channel="#data-alerts", message=message)

9. Practical checklist

Design

  • ETL/ELT/Zero-ETL decision rationale documented
  • Sensitive-data controls integrated into pipeline design

Ingestion

  • Schema change detection and response policy defined
  • CDC latency SLA documented

Transformation

  • Clear staging/intermediate/marts separation
  • not_null and unique coverage for critical models

Operations

  • Retry/backfill/reprocess paths validated
  • Alerts mapped to clear operator actions

Closing

The goal is not simply to make data flow.
The goal is to continuously deliver trustworthy data products.

Technology choices (ELT, dbt, orchestration, streaming) should always be evaluated by one standard: how reliably they support that outcome in production.

In Part 4, we will move from pipeline mechanics to governance implementation, including DataGovOps, data contracts, catalog practices, and access-control patterns.

Share This Article

Series Navigation

Data Engineering Playbook

3 / 7 · 3

Explore this topic·Start with featured series

한국어

Follow new posts via RSS

Until the newsletter opens, RSS is the fastest way to get updates.

Open RSS Guide