Wednesday, April 22, 2026
All posts
Lv.2 BeginnerData Engineering
25 min readLv.2 Beginner
SeriesData Engineering Playbook · Part 4/7View series hub

Data Engineering Playbook — Part 4: Data Quality & Governance Deep Dive

Data Engineering Playbook — Part 4: Data Quality & Governance Deep Dive

A data lake without governance becomes a data swamp. This post covers the six data quality dimensions and SLA tier design, the four principles of DataGovOps, Data Contract design with CI/CD enforcement, integrated catalog and lineage operations, PII masking techniques, Snowflake RBAC design, and everything you need to implement data quality and governance as code in 2026.

Series outline

  • Part 1 — Overview & 2026 Key Trends (published)
  • Part 2 — Data Architecture Design (published)
  • Part 3 — A Practical Guide to Building Data Pipelines (published)
  • Part 4 — Data Quality & Governance Deep Dive (this post)
  • Part 5 — Cloud & Infrastructure (FinOps, IaC) (upcoming)
  • Part 6 — AI-Native Data Engineering (upcoming)
  • Part 7 — DataOps & Team Operations Playbook (upcoming)

Table of contents

  1. Why data quality and governance?
  2. The six data quality dimensions framework
  3. DataGovOps — governance as code
  4. Data Contracts — the promise between producers and consumers
  5. Building and operating a data catalog
  6. Data lineage — tracking the journey of data
  7. Data observability — the infrastructure of trust
  8. PII management & compliance automation
  9. Access control & role design
  10. Practical checklist

1. Why data quality and governance?

Data governance in 2026 is no longer optional — it is a strategic differentiator. The quality of an AI system's output depends entirely on the quality of its training data. No matter how sophisticated the model, if the underlying data is wrong, the results cannot be trusted.

The real cost of poor data quality shows up across the board: it is the root cause of most governance failures, it has become the bottleneck for AI model performance, and as AI regulations tighten globally — through frameworks such as the EU AI Act, GDPR, CCPA, and various national data protection laws — compliance failures can result in fines worth tens of millions of euros.

"A data lake without governance becomes a data swamp. AI without data quality is a confident mistake machine."

Governance vs compliance

The two concepts are often confused, but they serve different purposes.

Data governanceData compliance
DefinitionAn internal framework for how data is managed and controlledAdherence to external laws and regulations
FocusDefining internal policies, roles, and quality standardsMeeting external requirements such as GDPR, HIPAA, and CCPA
RelationshipGovernance is the foundationCompliance is the outcome

Governance defines how data is managed. Compliance verifies that management happens within the rules.


2. The six data quality dimensions framework

You cannot measure data quality with a single "is it correct?" question. In practice, quality is measured and targeted across six distinct dimensions.

DimensionDefinitionHow to measure
AccuracyDoes the data correctly reflect real-world values?Random sample cross-check, source system comparison
CompletenessAre all required fields populated?NOT NULL rate, field fill rate (%)
ValidityDoes the data conform to allowed formats, ranges, and rules?Regex checks, allowed value lists, range validation
ConsistencyAre values consistent across systems and over time?Cross-system join validation, historical comparison
UniquenessAre there no duplicate records?PK duplicate count, fuzzy record detection
TimelinessIs the data sufficiently up to date?Last updated timestamp, freshness SLA

Measuring quality as code: Great Expectations

Great Expectations lets you define all six dimensions as code and validate them automatically.

import great_expectations as gx

context = gx.get_context()

datasource = context.sources.add_pandas("orders_datasource")
data_asset = datasource.add_dataframe_asset(name="orders")

suite = context.add_expectation_suite("orders_quality_suite")

# ① Accuracy: order_amount must be positive
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeBetween(
        column="order_amount_usd",
        min_value=0.01,
        max_value=99999.99
    )
)

# ② Completeness: required columns must not be null
for col in ["order_id", "customer_id", "order_status", "ordered_at"]:
    suite.add_expectation(
        gx.expectations.ExpectColumnValuesToNotBeNull(column=col)
    )

# ③ Uniqueness: order_id must have no duplicates
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeUnique(column="order_id")
)

# ④ Validity: order_status must be one of the allowed values
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeInSet(
        column="order_status",
        value_set={"placed", "shipped", "delivered", "cancelled", "refunded"}
    )
)

# ⑤ Volume: at least 100 orders per day
suite.add_expectation(
    gx.expectations.ExpectTableRowCountToBeBetween(
        min_value=100,
        max_value=1_000_000
    )
)

validator = context.get_validator(
    datasource_name="orders_datasource",
    data_asset_name="orders",
    expectation_suite=suite
)
results = validator.validate()

if not results.success:
    raise DataQualityError(f"Quality check failed: {results.statistics}")

Quality SLA tier design

Applying the same quality standard to every dataset blows up costs. Tier them by business impact.

TierTarget datasetsAccuracy targetFreshness SLAMonitoring cadence
GoldKPIs, financial reports, AI training data99.9%Within 1 hourReal-time
SilverOperational analytics, marketing dashboards99.0%Within 4 hoursEvery 1 hour
BronzeExploratory analysis, raw archive95.0%Within 24 hoursDaily

3. DataGovOps — governance as code

DataGovOps is the defining governance paradigm of 2026. Just as DevOps automated software delivery, DataGovOps handles compliance procedures, audit trails, and lineage tracking through code and automation rather than manual oversight.

The era of managing governance with spreadsheets is over. Governance is now a first-class engineering discipline embedded directly into the development workflow.

The four principles of DataGovOps

Principle 1: Policy as Code Define governance rules as executable code, not human-readable documents. Version-control them in Git and apply them automatically via CI/CD.

Principle 2: Shift Left Run quality and security checks at the start of the pipeline, not the end. Start validating the moment data enters from source systems.

Principle 3: Automation First Keep only what genuinely requires human judgment as a manual step. Automate PII detection, access control, audit logging, and lineage tracking.

Principle 4: Continuous Monitoring Not a one-time setup — monitor pipeline health 24/7. When anomalies are detected, automated alerts and blocking should kick in immediately.

Governance role definitions (RACI)

Governance cannot function without clear ownership. Define who is responsible for every dataset before anything else.

RoleResponsibilities
Data OwnerSenior-level. Ultimately accountable for the business use of a specific dataset. Approves access, decides data classification.
Data StewardOperational lead. Maintains metadata, monitors quality rules, manages the business glossary.
Data EngineerImplements pipelines, codes quality checks, builds lineage tracking.
Data Product ManagerBridge between business users and the technical team. Manages datasets as products.
CDO (Chief Data Officer)Sets governance strategy and policies, coordinates stakeholders.

Choosing a governance framework model

ModelCharacteristicsBest fit
CentralizedA central data team manages all policiesHeavily regulated industries: finance, healthcare
Federated (Data Mesh)Domain teams manage their own policiesLarge technology companies
Hybrid (federated standards)Enterprise-wide standards from center, execution at domain levelThe most popular model in 2026

4. Data Contracts

What is a Data Contract?

A Data Contract is a formal agreement between a data producer and its consumers. It codifies — and automatically enforces — what a dataset guarantees: schema, freshness, volume, and semantic meaning.

By 2026, Data Contracts have moved from theory into everyday practice. Producers validate contract compliance before data reaches consumers. Consumers can detect unexpected schema changes or volume drops before dashboards and models break.

A Data Contract is also a communication mechanism that forces source teams to coordinate with the data engineering team whenever they need to make a change.

Data Contract YAML specification

# data_contract_orders.yaml

apiVersion: v1
kind: DataContract
metadata:
  name: orders
  owner: "order-domain-team@company.com"
  version: "2.1.0"
  status: active
  created_at: "2026-01-15"
  updated_at: "2026-04-01"

schema:
  fields:
    - name: order_id
      type: STRING
      nullable: false
      description: "Unique order identifier (UUID)"
      pii: false

    - name: customer_email
      type: STRING
      nullable: true
      description: "Customer email address"
      pii: true                  # PII flag → triggers automatic masking
      classification: SENSITIVE

    - name: order_amount_usd
      type: FLOAT64
      nullable: false
      constraints:
        min: 0.01
        max: 99999.99

    - name: order_status
      type: STRING
      nullable: false
      constraints:
        allowed_values:
          - placed
          - shipped
          - delivered
          - cancelled
          - refunded

    - name: ordered_at
      type: TIMESTAMP
      nullable: false
      description: "Order creation timestamp (UTC)"

quality:
  freshness:
    max_age_hours: 2
    warn_after_hours: 1
  completeness:
    min_completeness_pct: 99.5
  volume:
    min_daily_rows: 500
    max_daily_rows: 5000000
    anomaly_threshold_pct: 30
  uniqueness:
    unique_columns:
      - order_id

slo:
  availability: 99.9
  latency_p95_seconds: 60
  incident_response_minutes: 30

consumers:
  - name: "revenue-dashboard"
    team: "analytics"
  - name: "fraud-detection-model"
    team: "ml-platform"

change_policy:
  breaking_changes_notice_days: 14
  deprecation_notice_days: 30
  versioning_strategy: semantic

Automated contract enforcement — CI/CD integration

# contract_validator.py — runs in GitHub Actions / GitLab CI
import yaml
import pandas as pd
from dataclasses import dataclass
from typing import List

@dataclass
class ContractViolation:
    contract_name: str
    dimension: str
    field: str
    message: str
    severity: str  # "error" | "warning"

def validate_contract(df: pd.DataFrame, contract_path: str) -> List[ContractViolation]:
    violations = []

    with open(contract_path) as f:
        contract = yaml.safe_load(f)

    schema_fields = {f["name"]: f for f in contract["schema"]["fields"]}

    for field_name, field_spec in schema_fields.items():

        if field_name not in df.columns:
            violations.append(ContractViolation(
                contract_name=contract["metadata"]["name"],
                dimension="schema",
                field=field_name,
                message=f"Field '{field_name}' not found in data",
                severity="error"
            ))
            continue

        if not field_spec.get("nullable", True):
            null_count = df[field_name].isnull().sum()
            if null_count > 0:
                violations.append(ContractViolation(
                    contract_name=contract["metadata"]["name"],
                    dimension="completeness",
                    field=field_name,
                    message=f"NOT NULL violation: {null_count:,} nulls found",
                    severity="error"
                ))

        constraints = field_spec.get("constraints", {})
        if "allowed_values" in constraints:
            invalid = df[
                ~df[field_name].isin(constraints["allowed_values"])
                & df[field_name].notna()
            ]
            if len(invalid) > 0:
                violations.append(ContractViolation(
                    contract_name=contract["metadata"]["name"],
                    dimension="validity",
                    field=field_name,
                    message=f"Allowed value violation: {invalid[field_name].unique()[:5].tolist()}",
                    severity="error"
                ))

        if "min" in constraints:
            below_min = df[df[field_name] < constraints["min"]]
            if len(below_min) > 0:
                violations.append(ContractViolation(
                    contract_name=contract["metadata"]["name"],
                    dimension="validity",
                    field=field_name,
                    message=f"Below minimum ({constraints['min']}): {len(below_min):,} rows",
                    severity="error"
                ))

    return violations


if __name__ == "__main__":
    import sys

    df = pd.read_parquet("data/orders_sample.parquet")
    violations = validate_contract(df, "contracts/data_contract_orders.yaml")

    errors = [v for v in violations if v.severity == "error"]
    warnings = [v for v in violations if v.severity == "warning"]

    for v in violations:
        icon = "❌" if v.severity == "error" else "⚠️"
        print(f"{icon} [{v.dimension}] {v.field}: {v.message}")

    if errors:
        print(f"\n{len(errors)} error(s) found — stopping the pipeline.")
        sys.exit(1)
    else:
        print(f"\n✅ Contract validation passed ({len(warnings)} warning(s))")

Contract tooling ecosystem

ToolHighlightsOpen-source
dbt ContractsDeclare schema contracts on dbt models; auto-detect breaking changes in CI
SchemataPython/TypeScript contract library with schema evolution support
Kafka Schema RegistryEnforces schema compatibility for streaming data (Confluent)Partial
OpenDataContractYAML-based open standard specification
AtlanIntegrates contract metadata with catalog, lineage, and policy management

5. Building and operating a data catalog

What is a data catalog?

A data catalog is a system that centrally manages metadata for all data assets in an organization. It answers: "Where is this data, what does it mean, can I trust it, and who owns it?" A well-run catalog can cut data discovery time from hours to minutes.

Without a catalog:
  "Where does this number come from?" → Slack DM → Ask 3 people → Answer 2 hours later

With a catalog:
  "Where does this number come from?" → Search the catalog → Lineage, owner, and quality in 2 minutes

What belongs in a catalog

Metadata typeContents
Technical metadataSchema, data types, partitions, row count, last updated time, storage location
Business metadataBusiness glossary links, owner, description, use cases, tags
Operational metadataPipeline run history, quality scores, SLA compliance, incident history
Governance metadataData classification (PII/public/restricted), access policies, retention period, compliance tags

Key data catalog tools in 2026

ToolHighlightsBest fit
AtlanAI-powered metadata, natural language search, governance integrationModern data stacks
AlationEnterprise-grade, behavior-based recommendations, Soda integrationLarge enterprises, finance
CollibraPowerful governance workflows, built for regulated industriesFinance, pharma
OpenMetadataOpen-source, self-hosted, highly customizableSmall teams, cost-sensitive
DataHubOpen-source, large-scale (LinkedIn-built)Tech companies
Databricks Unity CatalogIntegrated governance within the Databricks ecosystemDatabricks shops

Catalog adoption roadmap

An effective catalog does not try to register every data asset from day one. Start with the highest-business-impact domains and expand incrementally.

Phase 1 (weeks 1–4): Foundation
  - Select tooling and connect the tech stack (Snowflake, dbt, Airflow)
  - Register the 30 most critical Gold layer tables
  - Assign data owners and write business descriptions

Phase 2 (months 1–3): Automation
  - Set up automated metadata crawlers
  - Auto-sync dbt docs to the catalog
  - Automate PII tagging
  - Display data quality scores in the catalog

Phase 3 (months 3–6): Governance depth
  - Integrate access request workflows with the catalog
  - Build the business glossary
  - Enable full lineage visualization
  - Transition to managing data as Data Products

6. Data lineage — tracking the journey of data

What is data lineage?

Data lineage tracks the full journey of data from its source through transformation, loading, and consumption. It answers: "Where does this KPI number come from?" all the way to "Which dashboards are affected if I change this source table?"

Data lineage example: tracing a revenue KPI

[Postgres orders table]
       │ CDC
       ▼
[Bronze: raw_orders (Iceberg)]
       │ dbt stg_orders
       ▼
[Silver: stg_orders (cleaned)]
       │ dbt fct_daily_revenue
       ▼
[Gold: fct_daily_revenue]
       │
   ┌───┴──────────┐
   ▼              ▼
[Tableau       [ML model
 revenue       training
 dashboard]    data]

Three levels of lineage

① Table-level lineage: Which table is derived from which. fct_orders is built from stg_orders and dim_customers.

② Column-level lineage: Which source column a given column derives from. fct_orders.revenue is computed from stg_orders.amount_cents / 100.

③ Row-level lineage: Traces a specific record back to its source. Used mainly for audit and regulatory purposes. The most granular level, and the most expensive to implement.

OpenLineage — the lineage standard

OpenLineage is an open standard for collecting data lineage. Airflow, dbt, Spark, Flink, and other major tools all support emitting OpenLineage events.

{
  "eventType": "COMPLETE",
  "eventTime": "2026-04-19T06:00:00Z",
  "run": {
    "runId": "3b452f3c-a462-4c78-bf8f-f9f553e5c8e1"
  },
  "job": {
    "namespace": "data-platform",
    "name": "dbt.fct_daily_revenue"
  },
  "inputs": [
    {
      "namespace": "snowflake://company.us-east-1",
      "name": "analytics.silver.stg_orders"
    },
    {
      "namespace": "snowflake://company.us-east-1",
      "name": "analytics.silver.stg_customers"
    }
  ],
  "outputs": [
    {
      "namespace": "snowflake://company.us-east-1",
      "name": "analytics.gold.fct_daily_revenue"
    }
  ]
}

Impact analysis in practice

Before changing a source table schema, use lineage to automatically generate a list of affected downstream assets.

"We plan to drop the customer_email column from the orders table."
           │
           ▼
  [Run lineage impact analysis]
           │
           ▼
Affected assets:
  ⚠️ stg_orders (Silver) — uses customer_email
  ⚠️ fct_customer_ltv (Gold) — depends on stg_orders
  ⚠️ Tableau customer dashboard — depends on fct_customer_ltv
  ⚠️ Marketing email campaign ML model — uses customer_email directly

7. Data observability — the infrastructure of trust

Observability vs monitoring

MonitoringObservability
ApproachWatches for known failuresDiscovers unknown failures
Question"Did this pipeline run?""Can I trust this data?"
MethodThreshold-based alertsML-based anomaly detection

Data observability applies the APM (Application Performance Monitoring) concept from software systems to data pipelines. In 2026, the winning organizations are not those with the most data — they are those with the most trustworthy data.

The five signals of "data downtime"

The five detection dimensions from the Monte Carlo framework:

SignalWhat it means
FreshnessData older than expected ("Why isn't yesterday's data here yet?")
VolumeMore or fewer rows than expected (suddenly 0 rows, or a 10× spike)
SchemaUnexpected column additions, deletions, or type changes
DistributionStatistical shifts in values (average order amount suddenly 10× higher)
LineageAutomatically tracking the downstream impact of upstream asset changes

Soda in practice — SQL-based quality checks

Soda integrates SQL-based quality checks seamlessly with dbt, Airflow, and CI/CD pipelines.

# soda_checks_orders.yaml
# Run: soda scan -d snowflake orders

checks for orders:

  # Freshness
  - freshness(ordered_at) < 2h:
      name: "Orders data freshness within 2 hours"
      fail: when > 4h
      warn: when > 2h

  # Volume anomaly detection
  - row_count > 0:
      name: "Detect empty table"

  - row_count between 500 and 1000000:
      name: "Daily volume within normal range"
      warn: when not between 200 and 2000000

  # Completeness
  - missing_count(order_id) = 0:
      name: "No nulls in order_id"

  - missing_percent(customer_id) < 0.1%:
      name: "customer_id completeness >= 99.9%"

  # Validity
  - invalid_percent(order_status) < 0.01%:
      valid values: [placed, shipped, delivered, cancelled, refunded]
      name: "order_status allowed value check"

  # Uniqueness
  - duplicate_count(order_id) = 0:
      name: "No duplicate order_ids"

  # Referential integrity
  - referential integrity (customer_id) must exist in customers (customer_id):
      name: "Customer ID referential integrity"

Observability tools comparison 2026

ToolStrengthsPricingOpen-source
Monte CarloML-based anomaly detection, auto-lineage, coined "data downtime"Enterprise
SodaSQL-based, excellent dbt/Airflow integration, CI/CD friendlyFree tier availablePartial
Great ExpectationsOpen-source, rich Expectation libraryOpen-source
MetaplaneAutomated monitoring setup, easy warehouse integrationSMB-friendly
dbt built-in testsQuality checks inside dbt projects, no extra tooling neededIncluded with dbt plans

8. PII management & compliance automation

PII classification framework

PII (Personally Identifiable Information) is any data that can be used to identify an individual. As AI regulations advance globally — including the EU AI Act, GDPR updates, CCPA, and various national data protection frameworks — managing PII in data pipelines is no longer optional.

ClassificationExample fieldsTreatment
Direct identifiersName, SSN, passport number, emailAnonymization or tokenization
Quasi-identifiersDate of birth, address, ZIP code, IPPseudonymization or masking
Sensitive dataHealth records, financial accounts, credit scoresStrict access control + encryption + audit log

Data masking techniques

# data_masking.py — applied at pipeline ingestion time

import hashlib
import re

class DataMasker:

    @staticmethod
    def tokenize(value: str, salt: str = "company_secret") -> str:
        """
        Tokenize: convert to a consistent pseudonym (preserves JOIN capability).
        Suitable for fields like customer_id where analysis requires JOINs.
        """
        if value is None:
            return None
        return hashlib.sha256(f"{value}{salt}".encode()).hexdigest()[:16]

    @staticmethod
    def mask_email(email: str) -> str:
        """Partial masking: john.doe@company.com → jo***@company.com"""
        if not email or "@" not in email:
            return email
        local, domain = email.split("@", 1)
        masked_local = local[:2] + "***"
        return f"{masked_local}@{domain}"

    @staticmethod
    def mask_phone(phone: str) -> str:
        """Phone masking: 010-1234-5678 → 010-****-5678"""
        if not phone:
            return phone
        return re.sub(r'(\d{3})-(\d{4})-(\d{4})', r'\1-****-\3', phone)

    @staticmethod
    def mask_credit_card(cc: str) -> str:
        """Card masking: 4111-1111-1111-1234 → ****-****-****-1234"""
        if not cc:
            return cc
        digits = re.sub(r'\D', '', cc)
        return f"****-****-****-{digits[-4:]}"

    @staticmethod
    def anonymize(value) -> str:
        """Full anonymization: when individual identification is not needed at all."""
        return "REDACTED"


def apply_pii_masking(df):
    """Apply PII masking during Bronze → Silver transition."""
    masker = DataMasker()

    if "customer_email" in df.columns:
        df["customer_email"] = df["customer_email"].apply(masker.mask_email)

    if "phone_number" in df.columns:
        df["phone_number"] = df["phone_number"].apply(masker.mask_phone)

    if "customer_id" in df.columns:
        df["customer_id"] = df["customer_id"].apply(masker.tokenize)

    if "ssn" in df.columns:
        df.drop(columns=["ssn"], inplace=True)

    return df

PII flow in the Medallion architecture

[Bronze Layer]          ← Raw PII preserved (pii_admin access only)
  customer_email: "john@co.com"
  phone: "010-1234-5678"
  customer_id: "usr_abc123"
        │
        │  Masking applied automatically in Bronze → Silver pipeline
        ▼
[Silver Layer]          ← Masked data (analytics team can access)
  customer_email: "jo***@co.com"
  phone: "010-****-5678"
  customer_id: "a3f2b9..."    ← Tokenized (JOIN-capable)
        │
        ▼
[Gold Layer]            ← Aggregated data with no PII (org-wide access)
  revenue_by_segment: {...}
  daily_order_count: 1234

9. Access control & role design

Principle of Least Privilege

Every user and service should have only the minimum permissions required for their job. This is also the most frequently violated security principle in data platforms.

-- Snowflake RBAC example

-- Role hierarchy design
CREATE ROLE analyst_gold;      -- Gold read-only (all analysts)
CREATE ROLE analyst_silver;    -- Silver read-only
CREATE ROLE analyst_bronze;    -- Bronze read-only
CREATE ROLE data_engineer;     -- Silver/Gold write + Bronze read
CREATE ROLE pii_admin;         -- Full read including Bronze PII columns

-- Role inheritance
GRANT ROLE analyst_bronze TO ROLE analyst_silver;
GRANT ROLE analyst_silver TO ROLE analyst_gold;
GRANT ROLE analyst_gold TO ROLE data_engineer;

-- Table-level permissions
GRANT SELECT ON ALL TABLES IN SCHEMA gold TO ROLE analyst_gold;
GRANT SELECT ON ALL TABLES IN SCHEMA silver TO ROLE analyst_silver;

-- Column-level Dynamic Data Masking
CREATE MASKING POLICY email_mask AS (val STRING) RETURNS STRING ->
  CASE
    WHEN CURRENT_ROLE() IN ('pii_admin') THEN val
    WHEN CURRENT_ROLE() IN ('data_engineer') THEN
      REGEXP_REPLACE(val, '(^.{2}).*(@.*)', '\\1***\\2')
    ELSE '***REDACTED***'
  END;

ALTER TABLE silver.customers
MODIFY COLUMN email
SET MASKING POLICY email_mask;

Data access request workflow

Analyst A needs access to PII-containing data
           │
           ▼
[Submit access request via data catalog]
  - Requested dataset: bronze.customers (contains PII)
  - Purpose: churn customer segmentation analysis
  - Duration: 30 days
           │
           ▼
[Automated review triggers]
  ✅ Does requester's department match the data use policy?
  ✅ GDPR purpose limitation check (data minimization principle)
  ✅ Can the existing Silver masked data satisfy the need? → No
           │
           ▼
[Approval notification sent to data owner]
  → Owner approves (24-hour SLA)
           │
           ▼
[Temporary access automatically granted]
  + Audit log automatically created
  + Access automatically expires after 30 days

10. Practical checklist

Data quality

  • All Gold tables covered by checks across all six quality dimensions?
  • Quality gates in place at each pipeline stage (Bronze→Silver, Silver→Gold)?
  • Quality SLA tiers (Gold/Silver/Bronze) defined per dataset?
  • Automatic alerting and pipeline blocking on quality check failures?
  • dbt tests or Great Expectations / Soda integrated into CI/CD?

Governance framework

  • Data owner assigned for all major datasets?
  • Data classification scheme (public/internal/restricted/confidential) defined and applied?
  • Governance policies managed as code (Policy as Code), not documents?
  • Governance model (centralized/federated/hybrid) chosen to match org structure?

Data Contracts

  • Data Contracts (YAML specifications) written for major Gold tables?
  • Contracts automatically validated in the CI/CD pipeline?
  • Process in place to notify consumers 14+ days before breaking changes?
  • Schema Registry applied to Kafka streaming data?

Data catalog & lineage

  • Data catalog deployed with major assets registered?
  • Data lineage (source → transform → consume) tracked automatically?
  • Downstream impact analysis possible before schema changes?
  • Business glossary linked to the catalog?

PII & compliance

  • PII columns automatically detected and tagged?
  • PII masking automatically applied during Bronze→Silver transition?
  • Column-level access control (Dynamic Data Masking) in place?
  • Access request/approval/expiry workflow automated?
  • Data retention policies defined and automatically enforced?
  • Audit logs generated in compliance with applicable regulations (GDPR/CCPA/local data protection laws)?

Closing thoughts

Data quality and governance are not "build it once and you're done" projects. They are living systems that must evolve continuously as business changes, data grows, and regulations tighten.

The most important paradigm shift of 2026 is this: stop viewing governance as "regulation that slows down data engineering," and start viewing it as "an engineering practice that guarantees the reliability of data products." Pipelines with governance built in enable faster decision-making, more trustworthy AI, and operations free from compliance risk.

The next part goes deep into the cloud infrastructure and cost optimization (FinOps) that underpins all of this — IaC, multi-cloud strategy, and compute cost governance.


Part 5 preview: Cloud & Infrastructure Deep Dive

  • AWS vs GCP vs Azure data platform comparison
  • Codifying data infrastructure with Terraform (IaC)
  • FinOps — practical cloud data cost optimization
  • Operating data platforms on Kubernetes
  • Multi-cloud & hybrid cloud strategies

Share This Article

Series Navigation

Data Engineering Playbook

4 / 7 · 4

Explore this topic·Start with featured series

한국어

Follow new posts via RSS

Until the newsletter opens, RSS is the fastest way to get updates.

Open RSS Guide