Data Engineering Playbook Part 3: A Practical Guide to Building Data Pipelines
This post summarizes practical pipeline architecture and operations for 2026. We cover ETL/ELT/Zero-ETL decision criteria, dbt three-layer modeling, orchestration trade-offs across Airflow, Dagster, and Prefect, and Kafka+Flink streaming patterns. We close with reliability principles and a production-ready monitoring checklist.
Series outline
- Part 1: Overview and 2026 Trends (published)
- Part 2: Data Architecture Design (published)
- Part 3: A Practical Guide to Building Data Pipelines (current post)
- Part 4: Data Quality and Governance (upcoming)
- Part 5: Cloud Infrastructure and FinOps (upcoming)
- Part 6: AI-Native Data Engineering (upcoming)
- Part 7: DataOps Team Operations Playbook (upcoming)
Table of contents
- What is a data pipeline?
- ETL vs ELT vs Zero-ETL: which should you choose?
- Ingestion: how to bring data in
- Transformation: practical dbt patterns
- Orchestration: Airflow vs Dagster vs Prefect
- Streaming design with Kafka + Flink
- Reliability: building pipelines you can trust
- Monitoring and alerting
- Practical checklist
1. What is a data pipeline?
A data pipeline is an automated flow that collects data from multiple sources, applies required transformations, and reliably delivers data to analytical or serving systems.
In practice, it is your team's data delivery system.
Source systems (DB, API, events, files, SaaS)
-> Ingestion
-> Transformation
-> Storage
-> Serving (BI, ML, Apps)
When this system is unstable, the impact quickly spreads to dashboards, forecasting, ML quality, and operational decisions.
2. ETL vs ELT vs Zero-ETL
Core question: where does transformation happen?
ETL: Source -> Transform -> Load
- Transform in a separate processing layer first
- Still strong for strict compliance environments
ELT: Source -> Load -> Transform
- Load raw first, transform inside the warehouse
- Default pattern for modern cloud analytics
Zero-ETL: Source -> Direct integration/query -> Analysis
- Minimizes pipeline operations and data movement
- Still needs quality, governance, and semantic controls
| Category | ETL | ELT |
|---|---|---|
| Transform location | Separate processing tier | In-warehouse |
| Raw data retention | Limited | Easier reprocessing |
| Schema evolution | More rigid | More flexible |
| Security/compliance | Pre-load masking/cleaning | Post-load controls required |
| Cost model | External processing costs | Warehouse compute costs |
2026 recommendation
- Start with
ELTas the default. - Use
ETLor hybrid for high-regulation data flows and legacy constraints. - Treat
Zero-ETLas movement optimization, not a complete governance solution.
3. Ingestion: how to bring data in
Batch ingestion
| Tool | Strength | Best fit |
|---|---|---|
| Fivetran | Managed connectors, schema drift handling | Fast SaaS onboarding |
| Airbyte | Open source, extensible | Lower vendor lock-in |
| AWS Glue | Serverless ETL | AWS-centric stacks |
Streaming ingestion
Event Sources -> Kafka -> Flink -> Serving Store
- App events / IoT / CDC streams
- Targets: Iceberg, warehouse, OLAP DB
CDC (Change Data Capture) streams row-level changes from transactional databases.
PostgreSQL -> Debezium(CDC) -> Kafka -> Data Lake/Warehouse
(near real-time replication)
4. Transformation: practical dbt patterns
dbt handles the "T" in ELT with a SQL-first, software-engineering-friendly workflow.
Three-layer modeling
my_dbt_project/
models/
staging/
stg_orders.sql
stg_customers.sql
intermediate/
int_order_items_joined.sql
marts/
core/fct_orders.sql
marketing/dim_customers.sql
tests/
macros/
dbt_project.yml
staging: source normalizationintermediate: joins and reusable business logicmarts: final models consumed by BI/ML
Example staging model
with source as (
select * from {{ source('raw', 'orders') }}
),
renamed as (
select
id as order_id,
user_id as customer_id,
status as order_status,
created_at as ordered_at,
amount_cents / 100.0 as order_amount_usd
from source
)
select * from renamed
Example tests
version: 2
models:
- name: fct_orders
columns:
- name: order_id
tests: [unique, not_null]
- name: order_status
tests:
- not_null
- accepted_values:
values: ['placed', 'shipped', 'delivered', 'cancelled']
Materialization guide
Table
- Good for final marts queried directly by BI/ML
View
- Good for light transformations
Incremental
- Good for large append-heavy datasets
- Requires robust unique_key and upsert strategy
Ephemeral
- Good for helper logic reused in larger models
- Compiled into CTEs, no physical table
5. Orchestration: Airflow vs Dagster vs Prefect
An orchestrator manages execution order, retries, dependencies, failure handling, and scheduling.
| Category | Airflow | Dagster | Prefect |
|---|---|---|---|
| Core model | DAG/Task | Asset | Flow/Task |
| Strength | Ecosystem and maturity | Lineage and asset semantics | Python-first UX |
| Trade-off | Higher operational complexity | Learning curve | Fewer very-large-scale references |
6. Streaming design with Kafka + Flink
Kafka fundamentals
Topic: event category
Partition: unit of parallelism and ordering scope
Consumer Group: cooperative consumption group
Offset: record position in a partition
Retention: how long events are kept
Flink processing example
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.common import WatermarkStrategy, Duration
env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(4)
transactions = env.from_source(
kafka_source,
WatermarkStrategy.for_bounded_out_of_orderness(Duration.of_seconds(5)),
"Kafka Transactions"
)
Streaming lakehouse pattern
Event Sources
-> Kafka topics (durable log)
-> Flink (filter/join/window)
-> Iceberg (Bronze/Silver)
-> Trino/Spark SQL (batch/ad-hoc)
-> Redis/Cassandra (low-latency serving)
7. Reliability: building pipelines you can trust
Principle 1: Idempotency
Running the same pipeline multiple times should produce the same result.
insert into orders_summary
select date, sum(amount) as total
from orders
group by date
on conflict (date) do update
set total = excluded.total;
Principle 2: Fail fast
Stop early when data quality checks fail.
def validate_orders(df):
if df[['order_id', 'customer_id', 'amount']].isnull().any().any():
raise ValueError("Required column contains NULL")
if (df['amount'] < 0).any():
raise ValueError("Negative amount detected")
Principle 3: Reprocessability
Preserve Bronze raw data so Silver/Gold can be recomputed safely.
1) Keep Bronze immutable
2) Fix transformation logic
3) Rebuild Silver
4) Rebuild Gold marts
Principle 4: Backfill support
Support safe historical reprocessing for selected date ranges.
8. Monitoring and alerting
Four key metrics
1) Success rate
2) Data freshness
3) Volume anomaly
4) Runtime trend
Alerts must be actionable
def alert_slack_on_failure(context):
dag_id = context['dag'].dag_id
task_id = context['task_instance'].task_id
log_url = context['task_instance'].log_url
message = f"[FAIL] {dag_id}.{task_id} | {log_url}"
send_slack_message(channel="#data-alerts", message=message)
9. Practical checklist
Design
- ETL/ELT/Zero-ETL decision rationale documented
- Sensitive-data controls integrated into pipeline design
Ingestion
- Schema change detection and response policy defined
- CDC latency SLA documented
Transformation
- Clear staging/intermediate/marts separation
-
not_nullanduniquecoverage for critical models
Operations
- Retry/backfill/reprocess paths validated
- Alerts mapped to clear operator actions
Closing
The goal is not simply to make data flow.
The goal is to continuously deliver trustworthy data products.
Technology choices (ELT, dbt, orchestration, streaming) should always be evaluated by one standard: how reliably they support that outcome in production.
In Part 4, we will move from pipeline mechanics to governance implementation, including DataGovOps, data contracts, catalog practices, and access-control patterns.