Lv.2 BeginnerData Engineering

2026.04.2725 min readLv.2 Beginner

SeriesData Engineering Playbook · Part 4View series hub

Data Engineering Playbook — Part 4: Data Quality & Governance Deep Dive

A data lake without governance becomes a data swamp. This post covers the six data quality dimensions and SLA tier design, the four principles of DataGovOps, Data Contract design with CI/CD enforcement, integrated catalog and lineage operations, PII masking techniques, Snowflake RBAC design, and everything you need to implement data quality and governance as code in 2026.

Series outline

Part 1 — Overview & 2026 Key Trends (published)

Part 2 — Data Architecture Design (published)

Part 3 — A Practical Guide to Building Data Pipelines (published)

Part 4 — Data Quality & Governance Deep Dive (this post)

Part 5 — Cloud & Infrastructure (FinOps, IaC) (upcoming)

Part 6 — AI-Native Data Engineering (upcoming)

Part 7 — DataOps & Team Operations Playbook (upcoming)

Why data quality and governance?
The six data quality dimensions framework
DataGovOps — governance as code
Data Contracts — the promise between producers and consumers
Building and operating a data catalog
Data lineage — tracking the journey of data
Data observability — the infrastructure of trust
PII management & compliance automation
Access control & role design
Practical checklist

1. Why data quality and governance?

Data governance in 2026 is no longer optional — it is a strategic differentiator. The quality of an AI system's output depends entirely on the quality of its training data. No matter how sophisticated the model, if the underlying data is wrong, the results cannot be trusted.

The real cost of poor data quality shows up across the board: it is the root cause of most governance failures, it has become the bottleneck for AI model performance, and as AI regulations tighten globally — through frameworks such as the EU AI Act, GDPR, CCPA, and various national data protection laws — compliance failures can result in fines worth tens of millions of euros.

"A data lake without governance becomes a data swamp. AI without data quality is a confident mistake machine."

Governance vs compliance

The two concepts are often confused, but they serve different purposes.

	Data governance	Data compliance
Definition	An internal framework for how data is managed and controlled	Adherence to external laws and regulations
Focus	Defining internal policies, roles, and quality standards	Meeting external requirements such as GDPR, HIPAA, and CCPA
Relationship	Governance is the foundation	Compliance is the outcome

Governance defines how data is managed. Compliance verifies that management happens within the rules.

2. The six data quality dimensions framework

You cannot measure data quality with a single "is it correct?" question. In practice, quality is measured and targeted across six distinct dimensions.

Dimension	Definition	How to measure
Accuracy	Does the data correctly reflect real-world values?	Random sample cross-check, source system comparison
Completeness	Are all required fields populated?	NOT NULL rate, field fill rate (%)
Validity	Does the data conform to allowed formats, ranges, and rules?	Regex checks, allowed value lists, range validation
Consistency	Are values consistent across systems and over time?	Cross-system join validation, historical comparison
Uniqueness	Are there no duplicate records?	PK duplicate count, fuzzy record detection
Timeliness	Is the data sufficiently up to date?	Last updated timestamp, freshness SLA

Measuring quality as code: Great Expectations

Great Expectations lets you define all six dimensions as code and validate them automatically.

import great_expectations as gx

context = gx.get_context()

datasource = context.sources.add_pandas("orders_datasource")
data_asset = datasource.add_dataframe_asset(name="orders")

suite = context.add_expectation_suite("orders_quality_suite")

# ① Accuracy: order_amount must be positive
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeBetween(
        column="order_amount_usd",
        min_value=0.01,
        max_value=99999.99
    )
)

# ② Completeness: required columns must not be null
for col in ["order_id", "customer_id", "order_status", "ordered_at"]:
    suite.add_expectation(
        gx.expectations.ExpectColumnValuesToNotBeNull(column=col)
    )

# ③ Uniqueness: order_id must have no duplicates
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeUnique(column="order_id")
)

# ④ Validity: order_status must be one of the allowed values
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeInSet(
        column="order_status",
        value_set={"placed", "shipped", "delivered", "cancelled", "refunded"}
    )
)

# ⑤ Volume: at least 100 orders per day
suite.add_expectation(
    gx.expectations.ExpectTableRowCountToBeBetween(
        min_value=100,
        max_value=1_000_000
    )
)

validator = context.get_validator(
    datasource_name="orders_datasource",
    data_asset_name="orders",
    expectation_suite=suite
)
results = validator.validate()

if not results.success:
    raise DataQualityError(f"Quality check failed: {results.statistics}")

Quality SLA tier design

Applying the same quality standard to every dataset blows up costs. Tier them by business impact.

Tier	Target datasets	Accuracy target	Freshness SLA	Monitoring cadence
Gold	KPIs, financial reports, AI training data	99.9%	Within 1 hour	Real-time
Silver	Operational analytics, marketing dashboards	99.0%	Within 4 hours	Every 1 hour
Bronze	Exploratory analysis, raw archive	95.0%	Within 24 hours	Daily

3. DataGovOps — governance as code

DataGovOps is the defining governance paradigm of 2026. Just as DevOps automated software delivery, DataGovOps handles compliance procedures, audit trails, and lineage tracking through code and automation rather than manual oversight.

The era of managing governance with spreadsheets is over. Governance is now a first-class engineering discipline embedded directly into the development workflow.

The four principles of DataGovOps

Principle 1: Policy as Code Define governance rules as executable code, not human-readable documents. Version-control them in Git and apply them automatically via CI/CD.

Principle 2: Shift Left Run quality and security checks at the start of the pipeline, not the end. Start validating the moment data enters from source systems.

Principle 3: Automation First Keep only what genuinely requires human judgment as a manual step. Automate PII detection, access control, audit logging, and lineage tracking.

Principle 4: Continuous Monitoring Not a one-time setup — monitor pipeline health 24/7. When anomalies are detected, automated alerts and blocking should kick in immediately.

Governance role definitions (RACI)

Governance cannot function without clear ownership. Define who is responsible for every dataset before anything else.

Role	Responsibilities
Data Owner	Senior-level. Ultimately accountable for the business use of a specific dataset. Approves access, decides data classification.
Data Steward	Operational lead. Maintains metadata, monitors quality rules, manages the business glossary.
Data Engineer	Implements pipelines, codes quality checks, builds lineage tracking.
Data Product Manager	Bridge between business users and the technical team. Manages datasets as products.
CDO (Chief Data Officer)	Sets governance strategy and policies, coordinates stakeholders.

Choosing a governance framework model

Model	Characteristics	Best fit
Centralized	A central data team manages all policies	Heavily regulated industries: finance, healthcare
Federated (Data Mesh)	Domain teams manage their own policies	Large technology companies
Hybrid (federated standards)	Enterprise-wide standards from center, execution at domain level	The most popular model in 2026

4. Data Contracts

What is a Data Contract?

A Data Contract is a formal agreement between a data producer and its consumers. It codifies — and automatically enforces — what a dataset guarantees: schema, freshness, volume, and semantic meaning.

By 2026, Data Contracts have moved from theory into everyday practice. Producers validate contract compliance before data reaches consumers. Consumers can detect unexpected schema changes or volume drops before dashboards and models break.

A Data Contract is also a communication mechanism that forces source teams to coordinate with the data engineering team whenever they need to make a change.

Data Contract YAML specification

# data_contract_orders.yaml

apiVersion: v1
kind: DataContract
metadata:
  name: orders
  owner: "order-domain-team@company.com"
  version: "2.1.0"
  status: active
  created_at: "2026-01-15"
  updated_at: "2026-04-01"

schema:
  fields:
    - name: order_id
      type: STRING
      nullable: false
      description: "Unique order identifier (UUID)"
      pii: false

    - name: customer_email
      type: STRING
      nullable: true
      description: "Customer email address"
      pii: true                  # PII flag → triggers automatic masking
      classification: SENSITIVE

    - name: order_amount_usd
      type: FLOAT64
      nullable: false
      constraints:
        min: 0.01
        max: 99999.99

    - name: order_status
      type: STRING
      nullable: false
      constraints:
        allowed_values:
          - placed
          - shipped
          - delivered
          - cancelled
          - refunded

    - name: ordered_at
      type: TIMESTAMP
      nullable: false
      description: "Order creation timestamp (UTC)"

quality:
  freshness:
    max_age_hours: 2
    warn_after_hours: 1
  completeness:
    min_completeness_pct: 99.5
  volume:
    min_daily_rows: 500
    max_daily_rows: 5000000
    anomaly_threshold_pct: 30
  uniqueness:
    unique_columns:
      - order_id

slo:
  availability: 99.9
  latency_p95_seconds: 60
  incident_response_minutes: 30

consumers:
  - name: "revenue-dashboard"
    team: "analytics"
  - name: "fraud-detection-model"
    team: "ml-platform"

change_policy:
  breaking_changes_notice_days: 14
  deprecation_notice_days: 30
  versioning_strategy: semantic

Automated contract enforcement — CI/CD integration

# contract_validator.py — runs in GitHub Actions / GitLab CI
import yaml
import pandas as pd
from dataclasses import dataclass
from typing import List

@dataclass
class ContractViolation:
    contract_name: str
    dimension: str
    field: str
    message: str
    severity: str  # "error" | "warning"

def validate_contract(df: pd.DataFrame, contract_path: str) -> List[ContractViolation]:
    violations = []

    with open(contract_path) as f:
        contract = yaml.safe_load(f)

    schema_fields = {f["name"]: f for f in contract["schema"]["fields"]}

    for field_name, field_spec in schema_fields.items():

        if field_name not in df.columns:
            violations.append(ContractViolation(
                contract_name=contract["metadata"]["name"],
                dimension="schema",
                field=field_name,
                message=f"Field '{field_name}' not found in data",
                severity="error"
            ))
            continue

        if not field_spec.get("nullable", True):
            null_count = df[field_name].isnull().sum()
            if null_count > 0:
                violations.append(ContractViolation(
                    contract_name=contract["metadata"]["name"],
                    dimension="completeness",
                    field=field_name,
                    message=f"NOT NULL violation: {null_count:,} nulls found",
                    severity="error"
                ))

        constraints = field_spec.get("constraints", {})
        if "allowed_values" in constraints:
            invalid = df[
                ~df[field_name].isin(constraints["allowed_values"])
                & df[field_name].notna()
            ]
            if len(invalid) > 0:
                violations.append(ContractViolation(
                    contract_name=contract["metadata"]["name"],
                    dimension="validity",
                    field=field_name,
                    message=f"Allowed value violation: {invalid[field_name].unique()[:5].tolist()}",
                    severity="error"
                ))

        if "min" in constraints:
            below_min = df[df[field_name] < constraints["min"]]
            if len(below_min) > 0:
                violations.append(ContractViolation(
                    contract_name=contract["metadata"]["name"],
                    dimension="validity",
                    field=field_name,
                    message=f"Below minimum ({constraints['min']}): {len(below_min):,} rows",
                    severity="error"
                ))

    return violations


if __name__ == "__main__":
    import sys

    df = pd.read_parquet("data/orders_sample.parquet")
    violations = validate_contract(df, "contracts/data_contract_orders.yaml")

    errors = [v for v in violations if v.severity == "error"]
    warnings = [v for v in violations if v.severity == "warning"]

    for v in violations:
        icon = "❌" if v.severity == "error" else "⚠️"
        print(f"{icon} [{v.dimension}] {v.field}: {v.message}")

    if errors:
        print(f"\n{len(errors)} error(s) found — stopping the pipeline.")
        sys.exit(1)
    else:
        print(f"\n✅ Contract validation passed ({len(warnings)} warning(s))")

Contract tooling ecosystem

Tool	Highlights	Open-source
dbt Contracts	Declare schema contracts on dbt models; auto-detect breaking changes in CI	✅
Schemata	Python/TypeScript contract library with schema evolution support	✅
Kafka Schema Registry	Enforces schema compatibility for streaming data (Confluent)	Partial
OpenDataContract	YAML-based open standard specification	✅
Atlan	Integrates contract metadata with catalog, lineage, and policy management	❌

5. Building and operating a data catalog

What is a data catalog?

A data catalog is a system that centrally manages metadata for all data assets in an organization. It answers: "Where is this data, what does it mean, can I trust it, and who owns it?" A well-run catalog can cut data discovery time from hours to minutes.

Without a catalog:
  "Where does this number come from?" → Slack DM → Ask 3 people → Answer 2 hours later

With a catalog:
  "Where does this number come from?" → Search the catalog → Lineage, owner, and quality in 2 minutes

What belongs in a catalog

Metadata type	Contents
Technical metadata	Schema, data types, partitions, row count, last updated time, storage location
Business metadata	Business glossary links, owner, description, use cases, tags
Operational metadata	Pipeline run history, quality scores, SLA compliance, incident history
Governance metadata	Data classification (PII/public/restricted), access policies, retention period, compliance tags

Key data catalog tools in 2026

Tool	Highlights	Best fit
Atlan	AI-powered metadata, natural language search, governance integration	Modern data stacks
Alation	Enterprise-grade, behavior-based recommendations, Soda integration	Large enterprises, finance
Collibra	Powerful governance workflows, built for regulated industries	Finance, pharma
OpenMetadata	Open-source, self-hosted, highly customizable	Small teams, cost-sensitive
DataHub	Open-source, large-scale (LinkedIn-built)	Tech companies
Databricks Unity Catalog	Integrated governance within the Databricks ecosystem	Databricks shops

Catalog adoption roadmap

An effective catalog does not try to register every data asset from day one. Start with the highest-business-impact domains and expand incrementally.

Phase 1 (weeks 1–4): Foundation
  - Select tooling and connect the tech stack (Snowflake, dbt, Airflow)
  - Register the 30 most critical Gold layer tables
  - Assign data owners and write business descriptions

Phase 2 (months 1–3): Automation
  - Set up automated metadata crawlers
  - Auto-sync dbt docs to the catalog
  - Automate PII tagging
  - Display data quality scores in the catalog

Phase 3 (months 3–6): Governance depth
  - Integrate access request workflows with the catalog
  - Build the business glossary
  - Enable full lineage visualization
  - Transition to managing data as Data Products

6. Data lineage — tracking the journey of data

What is data lineage?

Data lineage tracks the full journey of data from its source through transformation, loading, and consumption. It answers: "Where does this KPI number come from?" all the way to "Which dashboards are affected if I change this source table?"

Data lineage example: tracing a revenue KPI

[Postgres orders table]
       │ CDC
       ▼
[Bronze: raw_orders (Iceberg)]
       │ dbt stg_orders
       ▼
[Silver: stg_orders (cleaned)]
       │ dbt fct_daily_revenue
       ▼
[Gold: fct_daily_revenue]
       │
   ┌───┴──────────┐
   ▼              ▼
[Tableau       [ML model
 revenue       training
 dashboard]    data]

Three levels of lineage

① Table-level lineage: Which table is derived from which. fct_orders is built from stg_orders and dim_customers.

② Column-level lineage: Which source column a given column derives from. fct_orders.revenue is computed from stg_orders.amount_cents / 100.

③ Row-level lineage: Traces a specific record back to its source. Used mainly for audit and regulatory purposes. The most granular level, and the most expensive to implement.

OpenLineage — the lineage standard

OpenLineage is an open standard for collecting data lineage. Airflow, dbt, Spark, Flink, and other major tools all support emitting OpenLineage events.

{
  "eventType": "COMPLETE",
  "eventTime": "2026-04-19T06:00:00Z",
  "run": {
    "runId": "3b452f3c-a462-4c78-bf8f-f9f553e5c8e1"
  },
  "job": {
    "namespace": "data-platform",
    "name": "dbt.fct_daily_revenue"
  },
  "inputs": [
    {
      "namespace": "snowflake://company.us-east-1",
      "name": "analytics.silver.stg_orders"
    },
    {
      "namespace": "snowflake://company.us-east-1",
      "name": "analytics.silver.stg_customers"
    }
  ],
  "outputs": [
    {
      "namespace": "snowflake://company.us-east-1",
      "name": "analytics.gold.fct_daily_revenue"
    }
  ]
}

Impact analysis in practice

Before changing a source table schema, use lineage to automatically generate a list of affected downstream assets.

"We plan to drop the customer_email column from the orders table."
           │
           ▼
  [Run lineage impact analysis]
           │
           ▼
Affected assets:
  ⚠️ stg_orders (Silver) — uses customer_email
  ⚠️ fct_customer_ltv (Gold) — depends on stg_orders
  ⚠️ Tableau customer dashboard — depends on fct_customer_ltv
  ⚠️ Marketing email campaign ML model — uses customer_email directly

7. Data observability — the infrastructure of trust

Observability vs monitoring

	Monitoring	Observability
Approach	Watches for known failures	Discovers unknown failures
Question	"Did this pipeline run?"	"Can I trust this data?"
Method	Threshold-based alerts	ML-based anomaly detection

Data observability applies the APM (Application Performance Monitoring) concept from software systems to data pipelines. In 2026, the winning organizations are not those with the most data — they are those with the most trustworthy data.

The five signals of "data downtime"

The five detection dimensions from the Monte Carlo framework:

Signal	What it means
Freshness	Data older than expected ("Why isn't yesterday's data here yet?")
Volume	More or fewer rows than expected (suddenly 0 rows, or a 10× spike)
Schema	Unexpected column additions, deletions, or type changes
Distribution	Statistical shifts in values (average order amount suddenly 10× higher)
Lineage	Automatically tracking the downstream impact of upstream asset changes

Soda in practice — SQL-based quality checks

Soda integrates SQL-based quality checks seamlessly with dbt, Airflow, and CI/CD pipelines.

# soda_checks_orders.yaml
# Run: soda scan -d snowflake orders

checks for orders:

  # Freshness
  - freshness(ordered_at) < 2h:
      name: "Orders data freshness within 2 hours"
      fail: when > 4h
      warn: when > 2h

  # Volume anomaly detection
  - row_count > 0:
      name: "Detect empty table"

  - row_count between 500 and 1000000:
      name: "Daily volume within normal range"
      warn: when not between 200 and 2000000

  # Completeness
  - missing_count(order_id) = 0:
      name: "No nulls in order_id"

  - missing_percent(customer_id) < 0.1%:
      name: "customer_id completeness >= 99.9%"

  # Validity
  - invalid_percent(order_status) < 0.01%:
      valid values: [placed, shipped, delivered, cancelled, refunded]
      name: "order_status allowed value check"

  # Uniqueness
  - duplicate_count(order_id) = 0:
      name: "No duplicate order_ids"

  # Referential integrity
  - referential integrity (customer_id) must exist in customers (customer_id):
      name: "Customer ID referential integrity"

Observability tools comparison 2026

Tool	Strengths	Pricing	Open-source
Monte Carlo	ML-based anomaly detection, auto-lineage, coined "data downtime"	Enterprise	❌
Soda	SQL-based, excellent dbt/Airflow integration, CI/CD friendly	Free tier available	Partial
Great Expectations	Open-source, rich Expectation library	Open-source	✅
Metaplane	Automated monitoring setup, easy warehouse integration	SMB-friendly	❌
dbt built-in tests	Quality checks inside dbt projects, no extra tooling needed	Included with dbt plans	✅

8. PII management & compliance automation

PII classification framework

PII (Personally Identifiable Information) is any data that can be used to identify an individual. As AI regulations advance globally — including the EU AI Act, GDPR updates, CCPA, and various national data protection frameworks — managing PII in data pipelines is no longer optional.

Classification	Example fields	Treatment
Direct identifiers	Name, SSN, passport number, email	Anonymization or tokenization
Quasi-identifiers	Date of birth, address, ZIP code, IP	Pseudonymization or masking
Sensitive data	Health records, financial accounts, credit scores	Strict access control + encryption + audit log

Data masking techniques

# data_masking.py — applied at pipeline ingestion time

import hashlib
import re

class DataMasker:

    @staticmethod
    def tokenize(value: str, salt: str = "company_secret") -> str:
        """
        Tokenize: convert to a consistent pseudonym (preserves JOIN capability).
        Suitable for fields like customer_id where analysis requires JOINs.
        """
        if value is None:
            return None
        return hashlib.sha256(f"{value}{salt}".encode()).hexdigest()[:16]

    @staticmethod
    def mask_email(email: str) -> str:
        """Partial masking: john.doe@company.com → jo***@company.com"""
        if not email or "@" not in email:
            return email
        local, domain = email.split("@", 1)
        masked_local = local[:2] + "***"
        return f"{masked_local}@{domain}"

    @staticmethod
    def mask_phone(phone: str) -> str:
        """Phone masking: 010-1234-5678 → 010-****-5678"""
        if not phone:
            return phone
        return re.sub(r'(\d{3})-(\d{4})-(\d{4})', r'\1-****-\3', phone)

    @staticmethod
    def mask_credit_card(cc: str) -> str:
        """Card masking: 4111-1111-1111-1234 → ****-****-****-1234"""
        if not cc:
            return cc
        digits = re.sub(r'\D', '', cc)
        return f"****-****-****-{digits[-4:]}"

    @staticmethod
    def anonymize(value) -> str:
        """Full anonymization: when individual identification is not needed at all."""
        return "REDACTED"


def apply_pii_masking(df):
    """Apply PII masking during Bronze → Silver transition."""
    masker = DataMasker()

    if "customer_email" in df.columns:
        df["customer_email"] = df["customer_email"].apply(masker.mask_email)

    if "phone_number" in df.columns:
        df["phone_number"] = df["phone_number"].apply(masker.mask_phone)

    if "customer_id" in df.columns:
        df["customer_id"] = df["customer_id"].apply(masker.tokenize)

    if "ssn" in df.columns:
        df.drop(columns=["ssn"], inplace=True)

    return df

PII flow in the Medallion architecture

[Bronze Layer]          ← Raw PII preserved (pii_admin access only)
  customer_email: "john@co.com"
  phone: "010-1234-5678"
  customer_id: "usr_abc123"
        │
        │  Masking applied automatically in Bronze → Silver pipeline
        ▼
[Silver Layer]          ← Masked data (analytics team can access)
  customer_email: "jo***@co.com"
  phone: "010-****-5678"
  customer_id: "a3f2b9..."    ← Tokenized (JOIN-capable)
        │
        ▼
[Gold Layer]            ← Aggregated data with no PII (org-wide access)
  revenue_by_segment: {...}
  daily_order_count: 1234

9. Access control & role design

Principle of Least Privilege

Every user and service should have only the minimum permissions required for their job. This is also the most frequently violated security principle in data platforms.

-- Snowflake RBAC example

-- Role hierarchy design
CREATE ROLE analyst_gold;      -- Gold read-only (all analysts)
CREATE ROLE analyst_silver;    -- Silver read-only
CREATE ROLE analyst_bronze;    -- Bronze read-only
CREATE ROLE data_engineer;     -- Silver/Gold write + Bronze read
CREATE ROLE pii_admin;         -- Full read including Bronze PII columns

-- Role inheritance
GRANT ROLE analyst_bronze TO ROLE analyst_silver;
GRANT ROLE analyst_silver TO ROLE analyst_gold;
GRANT ROLE analyst_gold TO ROLE data_engineer;

-- Table-level permissions
GRANT SELECT ON ALL TABLES IN SCHEMA gold TO ROLE analyst_gold;
GRANT SELECT ON ALL TABLES IN SCHEMA silver TO ROLE analyst_silver;

-- Column-level Dynamic Data Masking
CREATE MASKING POLICY email_mask AS (val STRING) RETURNS STRING ->
  CASE
    WHEN CURRENT_ROLE() IN ('pii_admin') THEN val
    WHEN CURRENT_ROLE() IN ('data_engineer') THEN
      REGEXP_REPLACE(val, '(^.{2}).*(@.*)', '\\1***\\2')
    ELSE '***REDACTED***'
  END;

ALTER TABLE silver.customers
MODIFY COLUMN email
SET MASKING POLICY email_mask;

Data access request workflow

Analyst A needs access to PII-containing data
           │
           ▼
[Submit access request via data catalog]
  - Requested dataset: bronze.customers (contains PII)
  - Purpose: churn customer segmentation analysis
  - Duration: 30 days
           │
           ▼
[Automated review triggers]
  ✅ Does requester's department match the data use policy?
  ✅ GDPR purpose limitation check (data minimization principle)
  ✅ Can the existing Silver masked data satisfy the need? → No
           │
           ▼
[Approval notification sent to data owner]
  → Owner approves (24-hour SLA)
           │
           ▼
[Temporary access automatically granted]
  + Audit log automatically created
  + Access automatically expires after 30 days

10. Practical checklist

Data quality

All Gold tables covered by checks across all six quality dimensions?
Quality gates in place at each pipeline stage (Bronze→Silver, Silver→Gold)?
Quality SLA tiers (Gold/Silver/Bronze) defined per dataset?
Automatic alerting and pipeline blocking on quality check failures?
dbt tests or Great Expectations / Soda integrated into CI/CD?

Governance framework

Data owner assigned for all major datasets?
Data classification scheme (public/internal/restricted/confidential) defined and applied?
Governance policies managed as code (Policy as Code), not documents?
Governance model (centralized/federated/hybrid) chosen to match org structure?

Data Contracts

Data Contracts (YAML specifications) written for major Gold tables?
Contracts automatically validated in the CI/CD pipeline?
Process in place to notify consumers 14+ days before breaking changes?
Schema Registry applied to Kafka streaming data?

Data catalog & lineage

Data catalog deployed with major assets registered?
Data lineage (source → transform → consume) tracked automatically?
Downstream impact analysis possible before schema changes?
Business glossary linked to the catalog?

PII & compliance

PII columns automatically detected and tagged?
PII masking automatically applied during Bronze→Silver transition?
Column-level access control (Dynamic Data Masking) in place?
Access request/approval/expiry workflow automated?
Data retention policies defined and automatically enforced?
Audit logs generated in compliance with applicable regulations (GDPR/CCPA/local data protection laws)?

Closing thoughts

Data quality and governance are not "build it once and you're done" projects. They are living systems that must evolve continuously as business changes, data grows, and regulations tighten.

The most important paradigm shift of 2026 is this: stop viewing governance as "regulation that slows down data engineering," and start viewing it as "an engineering practice that guarantees the reliability of data products." Pipelines with governance built in enable faster decision-making, more trustworthy AI, and operations free from compliance risk.

The next part goes deep into the cloud infrastructure and cost optimization (FinOps) that underpins all of this — IaC, multi-cloud strategy, and compute cost governance.

Part 5 preview: Cloud & Infrastructure Deep Dive

AWS vs GCP vs Azure data platform comparison

Codifying data infrastructure with Terraform (IaC)

FinOps — practical cloud data cost optimization

Operating data platforms on Kubernetes

Multi-cloud & hybrid cloud strategies

References

Use these documents when re-checking the technical claims and operational guidance in this article.