Lv.2 BeginnerData Engineering

2026.04.2725 min readLv.2 Beginner

SeriesData Engineering Playbook · Part 3View series hub

Data Engineering Playbook Part 3: A Practical Guide to Building Data Pipelines

This post summarizes practical pipeline architecture and operations for 2026. We cover ETL/ELT/Zero-ETL decision criteria, dbt three-layer modeling, orchestration trade-offs across Airflow, Dagster, and Prefect, and Kafka+Flink streaming patterns. We close with reliability principles and a production-ready monitoring checklist.

Series outline

Part 1: Overview and 2026 Trends (published)

Part 2: Data Architecture Design (published)

Part 3: A Practical Guide to Building Data Pipelines (current post)

Part 4: Data Quality and Governance (upcoming)

Part 5: Cloud Infrastructure and FinOps (upcoming)

Part 6: AI-Native Data Engineering (upcoming)

Part 7: DataOps Team Operations Playbook (upcoming)

What is a data pipeline?
ETL vs ELT vs Zero-ETL: which should you choose?
Ingestion: how to bring data in
Transformation: practical dbt patterns
Orchestration: Airflow vs Dagster vs Prefect
Streaming design with Kafka + Flink
Reliability: building pipelines you can trust
Monitoring and alerting
Practical checklist

1. What is a data pipeline?

A data pipeline is an automated flow that collects data from multiple sources, applies required transformations, and reliably delivers data to analytical or serving systems.
In practice, it is your team's data delivery system.

Source systems (DB, API, events, files, SaaS)
-> Ingestion
-> Transformation
-> Storage
-> Serving (BI, ML, Apps)

When this system is unstable, the impact quickly spreads to dashboards, forecasting, ML quality, and operational decisions.

2. ETL vs ELT vs Zero-ETL

Core question: where does transformation happen?

ETL: Source -> Transform -> Load
- Transform in a separate processing layer first
- Still strong for strict compliance environments

ELT: Source -> Load -> Transform
- Load raw first, transform inside the warehouse
- Default pattern for modern cloud analytics

Zero-ETL: Source -> Direct integration/query -> Analysis
- Minimizes pipeline operations and data movement
- Still needs quality, governance, and semantic controls

Category	ETL	ELT
Transform location	Separate processing tier	In-warehouse
Raw data retention	Limited	Easier reprocessing
Schema evolution	More rigid	More flexible
Security/compliance	Pre-load masking/cleaning	Post-load controls required
Cost model	External processing costs	Warehouse compute costs

2026 recommendation

Start with ELT as the default.
Use ETL or hybrid for high-regulation data flows and legacy constraints.
Treat Zero-ETL as movement optimization, not a complete governance solution.

3. Ingestion: how to bring data in

Batch ingestion

Tool	Strength	Best fit
Fivetran	Managed connectors, schema drift handling	Fast SaaS onboarding
Airbyte	Open source, extensible	Lower vendor lock-in
AWS Glue	Serverless ETL	AWS-centric stacks

Streaming ingestion

Event Sources -> Kafka -> Flink -> Serving Store
- App events / IoT / CDC streams
- Targets: Iceberg, warehouse, OLAP DB

CDC (Change Data Capture) streams row-level changes from transactional databases.

PostgreSQL -> Debezium(CDC) -> Kafka -> Data Lake/Warehouse
(near real-time replication)

4. Transformation: practical dbt patterns

dbt handles the "T" in ELT with a SQL-first, software-engineering-friendly workflow.

Three-layer modeling

my_dbt_project/
  models/
    staging/
      stg_orders.sql
      stg_customers.sql
    intermediate/
      int_order_items_joined.sql
    marts/
      core/fct_orders.sql
      marketing/dim_customers.sql
  tests/
  macros/
  dbt_project.yml

staging: source normalization
intermediate: joins and reusable business logic
marts: final models consumed by BI/ML

Example staging model

with source as (
  select * from {{ source('raw', 'orders') }}
),
renamed as (
  select
    id as order_id,
    user_id as customer_id,
    status as order_status,
    created_at as ordered_at,
    amount_cents / 100.0 as order_amount_usd
  from source
)
select * from renamed

Example tests

version: 2

models:
  - name: fct_orders
    columns:
      - name: order_id
        tests: [unique, not_null]
      - name: order_status
        tests:
          - not_null
          - accepted_values:
              values: ['placed', 'shipped', 'delivered', 'cancelled']

Materialization guide

Table
- Good for final marts queried directly by BI/ML

View
- Good for light transformations

Incremental
- Good for large append-heavy datasets
- Requires robust unique_key and upsert strategy

Ephemeral
- Good for helper logic reused in larger models
- Compiled into CTEs, no physical table

5. Orchestration: Airflow vs Dagster vs Prefect

An orchestrator manages execution order, retries, dependencies, failure handling, and scheduling.

Category	Airflow	Dagster	Prefect
Core model	DAG/Task	Asset	Flow/Task
Strength	Ecosystem and maturity	Lineage and asset semantics	Python-first UX
Trade-off	Higher operational complexity	Learning curve	Fewer very-large-scale references

6. Streaming design with Kafka + Flink

Kafka fundamentals

Topic: event category
Partition: unit of parallelism and ordering scope
Consumer Group: cooperative consumption group
Offset: record position in a partition
Retention: how long events are kept

Flink processing example

from pyflink.datastream import StreamExecutionEnvironment
from pyflink.common import WatermarkStrategy, Duration

env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(4)

transactions = env.from_source(
    kafka_source,
    WatermarkStrategy.for_bounded_out_of_orderness(Duration.of_seconds(5)),
    "Kafka Transactions"
)

Streaming lakehouse pattern

Event Sources
-> Kafka topics (durable log)
-> Flink (filter/join/window)
-> Iceberg (Bronze/Silver)
-> Trino/Spark SQL (batch/ad-hoc)
-> Redis/Cassandra (low-latency serving)

7. Reliability: building pipelines you can trust

Principle 1: Idempotency

Running the same pipeline multiple times should produce the same result.

insert into orders_summary
select date, sum(amount) as total
from orders
group by date
on conflict (date) do update
set total = excluded.total;

Principle 2: Fail fast

Stop early when data quality checks fail.

def validate_orders(df):
    if df[['order_id', 'customer_id', 'amount']].isnull().any().any():
        raise ValueError("Required column contains NULL")
    if (df['amount'] < 0).any():
        raise ValueError("Negative amount detected")

Principle 3: Reprocessability

Preserve Bronze raw data so Silver/Gold can be recomputed safely.

1) Keep Bronze immutable
2) Fix transformation logic
3) Rebuild Silver
4) Rebuild Gold marts

Principle 4: Backfill support

Support safe historical reprocessing for selected date ranges.

8. Monitoring and alerting

Four key metrics

1) Success rate
2) Data freshness
3) Volume anomaly
4) Runtime trend

Alerts must be actionable

def alert_slack_on_failure(context):
    dag_id = context['dag'].dag_id
    task_id = context['task_instance'].task_id
    log_url = context['task_instance'].log_url
    message = f"[FAIL] {dag_id}.{task_id} | {log_url}"
    send_slack_message(channel="#data-alerts", message=message)

9. Practical checklist

Design

ETL/ELT/Zero-ETL decision rationale documented
Sensitive-data controls integrated into pipeline design

Ingestion

Schema change detection and response policy defined
CDC latency SLA documented

Transformation

Clear staging/intermediate/marts separation
not_null and unique coverage for critical models

Operations

Retry/backfill/reprocess paths validated
Alerts mapped to clear operator actions

Closing

The goal is not simply to make data flow.
The goal is to continuously deliver trustworthy data products.

Technology choices (ELT, dbt, orchestration, streaming) should always be evaluated by one standard: how reliably they support that outcome in production.

In Part 4, we will move from pipeline mechanics to governance implementation, including DataGovOps, data contracts, catalog practices, and access-control patterns.

References

Use these documents when re-checking the technical claims and operational guidance in this article.