Data Engineering Playbook — Part 4: Data Quality & Governance Deep Dive
A data lake without governance becomes a data swamp. This post covers the six data quality dimensions and SLA tier design, the four principles of DataGovOps, Data Contract design with CI/CD enforcement, integrated catalog and lineage operations, PII masking techniques, Snowflake RBAC design, and everything you need to implement data quality and governance as code in 2026.
Series outline
- Part 1 — Overview & 2026 Key Trends (published)
- Part 2 — Data Architecture Design (published)
- Part 3 — A Practical Guide to Building Data Pipelines (published)
- Part 4 — Data Quality & Governance Deep Dive (this post)
- Part 5 — Cloud & Infrastructure (FinOps, IaC) (upcoming)
- Part 6 — AI-Native Data Engineering (upcoming)
- Part 7 — DataOps & Team Operations Playbook (upcoming)
Table of contents
- Why data quality and governance?
- The six data quality dimensions framework
- DataGovOps — governance as code
- Data Contracts — the promise between producers and consumers
- Building and operating a data catalog
- Data lineage — tracking the journey of data
- Data observability — the infrastructure of trust
- PII management & compliance automation
- Access control & role design
- Practical checklist
1. Why data quality and governance?
Data governance in 2026 is no longer optional — it is a strategic differentiator. The quality of an AI system's output depends entirely on the quality of its training data. No matter how sophisticated the model, if the underlying data is wrong, the results cannot be trusted.
The real cost of poor data quality shows up across the board: it is the root cause of most governance failures, it has become the bottleneck for AI model performance, and as AI regulations tighten globally — through frameworks such as the EU AI Act, GDPR, CCPA, and various national data protection laws — compliance failures can result in fines worth tens of millions of euros.
"A data lake without governance becomes a data swamp. AI without data quality is a confident mistake machine."
Governance vs compliance
The two concepts are often confused, but they serve different purposes.
| Data governance | Data compliance | |
|---|---|---|
| Definition | An internal framework for how data is managed and controlled | Adherence to external laws and regulations |
| Focus | Defining internal policies, roles, and quality standards | Meeting external requirements such as GDPR, HIPAA, and CCPA |
| Relationship | Governance is the foundation | Compliance is the outcome |
Governance defines how data is managed. Compliance verifies that management happens within the rules.
2. The six data quality dimensions framework
You cannot measure data quality with a single "is it correct?" question. In practice, quality is measured and targeted across six distinct dimensions.
| Dimension | Definition | How to measure |
|---|---|---|
| Accuracy | Does the data correctly reflect real-world values? | Random sample cross-check, source system comparison |
| Completeness | Are all required fields populated? | NOT NULL rate, field fill rate (%) |
| Validity | Does the data conform to allowed formats, ranges, and rules? | Regex checks, allowed value lists, range validation |
| Consistency | Are values consistent across systems and over time? | Cross-system join validation, historical comparison |
| Uniqueness | Are there no duplicate records? | PK duplicate count, fuzzy record detection |
| Timeliness | Is the data sufficiently up to date? | Last updated timestamp, freshness SLA |
Measuring quality as code: Great Expectations
Great Expectations lets you define all six dimensions as code and validate them automatically.
import great_expectations as gx
context = gx.get_context()
datasource = context.sources.add_pandas("orders_datasource")
data_asset = datasource.add_dataframe_asset(name="orders")
suite = context.add_expectation_suite("orders_quality_suite")
# ① Accuracy: order_amount must be positive
suite.add_expectation(
gx.expectations.ExpectColumnValuesToBeBetween(
column="order_amount_usd",
min_value=0.01,
max_value=99999.99
)
)
# ② Completeness: required columns must not be null
for col in ["order_id", "customer_id", "order_status", "ordered_at"]:
suite.add_expectation(
gx.expectations.ExpectColumnValuesToNotBeNull(column=col)
)
# ③ Uniqueness: order_id must have no duplicates
suite.add_expectation(
gx.expectations.ExpectColumnValuesToBeUnique(column="order_id")
)
# ④ Validity: order_status must be one of the allowed values
suite.add_expectation(
gx.expectations.ExpectColumnValuesToBeInSet(
column="order_status",
value_set={"placed", "shipped", "delivered", "cancelled", "refunded"}
)
)
# ⑤ Volume: at least 100 orders per day
suite.add_expectation(
gx.expectations.ExpectTableRowCountToBeBetween(
min_value=100,
max_value=1_000_000
)
)
validator = context.get_validator(
datasource_name="orders_datasource",
data_asset_name="orders",
expectation_suite=suite
)
results = validator.validate()
if not results.success:
raise DataQualityError(f"Quality check failed: {results.statistics}")
Quality SLA tier design
Applying the same quality standard to every dataset blows up costs. Tier them by business impact.
| Tier | Target datasets | Accuracy target | Freshness SLA | Monitoring cadence |
|---|---|---|---|---|
| Gold | KPIs, financial reports, AI training data | 99.9% | Within 1 hour | Real-time |
| Silver | Operational analytics, marketing dashboards | 99.0% | Within 4 hours | Every 1 hour |
| Bronze | Exploratory analysis, raw archive | 95.0% | Within 24 hours | Daily |
3. DataGovOps — governance as code
DataGovOps is the defining governance paradigm of 2026. Just as DevOps automated software delivery, DataGovOps handles compliance procedures, audit trails, and lineage tracking through code and automation rather than manual oversight.
The era of managing governance with spreadsheets is over. Governance is now a first-class engineering discipline embedded directly into the development workflow.
The four principles of DataGovOps
Principle 1: Policy as Code Define governance rules as executable code, not human-readable documents. Version-control them in Git and apply them automatically via CI/CD.
Principle 2: Shift Left Run quality and security checks at the start of the pipeline, not the end. Start validating the moment data enters from source systems.
Principle 3: Automation First Keep only what genuinely requires human judgment as a manual step. Automate PII detection, access control, audit logging, and lineage tracking.
Principle 4: Continuous Monitoring Not a one-time setup — monitor pipeline health 24/7. When anomalies are detected, automated alerts and blocking should kick in immediately.
Governance role definitions (RACI)
Governance cannot function without clear ownership. Define who is responsible for every dataset before anything else.
| Role | Responsibilities |
|---|---|
| Data Owner | Senior-level. Ultimately accountable for the business use of a specific dataset. Approves access, decides data classification. |
| Data Steward | Operational lead. Maintains metadata, monitors quality rules, manages the business glossary. |
| Data Engineer | Implements pipelines, codes quality checks, builds lineage tracking. |
| Data Product Manager | Bridge between business users and the technical team. Manages datasets as products. |
| CDO (Chief Data Officer) | Sets governance strategy and policies, coordinates stakeholders. |
Choosing a governance framework model
| Model | Characteristics | Best fit |
|---|---|---|
| Centralized | A central data team manages all policies | Heavily regulated industries: finance, healthcare |
| Federated (Data Mesh) | Domain teams manage their own policies | Large technology companies |
| Hybrid (federated standards) | Enterprise-wide standards from center, execution at domain level | The most popular model in 2026 |
4. Data Contracts
What is a Data Contract?
A Data Contract is a formal agreement between a data producer and its consumers. It codifies — and automatically enforces — what a dataset guarantees: schema, freshness, volume, and semantic meaning.
By 2026, Data Contracts have moved from theory into everyday practice. Producers validate contract compliance before data reaches consumers. Consumers can detect unexpected schema changes or volume drops before dashboards and models break.
A Data Contract is also a communication mechanism that forces source teams to coordinate with the data engineering team whenever they need to make a change.
Data Contract YAML specification
# data_contract_orders.yaml
apiVersion: v1
kind: DataContract
metadata:
name: orders
owner: "order-domain-team@company.com"
version: "2.1.0"
status: active
created_at: "2026-01-15"
updated_at: "2026-04-01"
schema:
fields:
- name: order_id
type: STRING
nullable: false
description: "Unique order identifier (UUID)"
pii: false
- name: customer_email
type: STRING
nullable: true
description: "Customer email address"
pii: true # PII flag → triggers automatic masking
classification: SENSITIVE
- name: order_amount_usd
type: FLOAT64
nullable: false
constraints:
min: 0.01
max: 99999.99
- name: order_status
type: STRING
nullable: false
constraints:
allowed_values:
- placed
- shipped
- delivered
- cancelled
- refunded
- name: ordered_at
type: TIMESTAMP
nullable: false
description: "Order creation timestamp (UTC)"
quality:
freshness:
max_age_hours: 2
warn_after_hours: 1
completeness:
min_completeness_pct: 99.5
volume:
min_daily_rows: 500
max_daily_rows: 5000000
anomaly_threshold_pct: 30
uniqueness:
unique_columns:
- order_id
slo:
availability: 99.9
latency_p95_seconds: 60
incident_response_minutes: 30
consumers:
- name: "revenue-dashboard"
team: "analytics"
- name: "fraud-detection-model"
team: "ml-platform"
change_policy:
breaking_changes_notice_days: 14
deprecation_notice_days: 30
versioning_strategy: semantic
Automated contract enforcement — CI/CD integration
# contract_validator.py — runs in GitHub Actions / GitLab CI
import yaml
import pandas as pd
from dataclasses import dataclass
from typing import List
@dataclass
class ContractViolation:
contract_name: str
dimension: str
field: str
message: str
severity: str # "error" | "warning"
def validate_contract(df: pd.DataFrame, contract_path: str) -> List[ContractViolation]:
violations = []
with open(contract_path) as f:
contract = yaml.safe_load(f)
schema_fields = {f["name"]: f for f in contract["schema"]["fields"]}
for field_name, field_spec in schema_fields.items():
if field_name not in df.columns:
violations.append(ContractViolation(
contract_name=contract["metadata"]["name"],
dimension="schema",
field=field_name,
message=f"Field '{field_name}' not found in data",
severity="error"
))
continue
if not field_spec.get("nullable", True):
null_count = df[field_name].isnull().sum()
if null_count > 0:
violations.append(ContractViolation(
contract_name=contract["metadata"]["name"],
dimension="completeness",
field=field_name,
message=f"NOT NULL violation: {null_count:,} nulls found",
severity="error"
))
constraints = field_spec.get("constraints", {})
if "allowed_values" in constraints:
invalid = df[
~df[field_name].isin(constraints["allowed_values"])
& df[field_name].notna()
]
if len(invalid) > 0:
violations.append(ContractViolation(
contract_name=contract["metadata"]["name"],
dimension="validity",
field=field_name,
message=f"Allowed value violation: {invalid[field_name].unique()[:5].tolist()}",
severity="error"
))
if "min" in constraints:
below_min = df[df[field_name] < constraints["min"]]
if len(below_min) > 0:
violations.append(ContractViolation(
contract_name=contract["metadata"]["name"],
dimension="validity",
field=field_name,
message=f"Below minimum ({constraints['min']}): {len(below_min):,} rows",
severity="error"
))
return violations
if __name__ == "__main__":
import sys
df = pd.read_parquet("data/orders_sample.parquet")
violations = validate_contract(df, "contracts/data_contract_orders.yaml")
errors = [v for v in violations if v.severity == "error"]
warnings = [v for v in violations if v.severity == "warning"]
for v in violations:
icon = "❌" if v.severity == "error" else "⚠️"
print(f"{icon} [{v.dimension}] {v.field}: {v.message}")
if errors:
print(f"\n{len(errors)} error(s) found — stopping the pipeline.")
sys.exit(1)
else:
print(f"\n✅ Contract validation passed ({len(warnings)} warning(s))")
Contract tooling ecosystem
| Tool | Highlights | Open-source |
|---|---|---|
| dbt Contracts | Declare schema contracts on dbt models; auto-detect breaking changes in CI | ✅ |
| Schemata | Python/TypeScript contract library with schema evolution support | ✅ |
| Kafka Schema Registry | Enforces schema compatibility for streaming data (Confluent) | Partial |
| OpenDataContract | YAML-based open standard specification | ✅ |
| Atlan | Integrates contract metadata with catalog, lineage, and policy management | ❌ |
5. Building and operating a data catalog
What is a data catalog?
A data catalog is a system that centrally manages metadata for all data assets in an organization. It answers: "Where is this data, what does it mean, can I trust it, and who owns it?" A well-run catalog can cut data discovery time from hours to minutes.
Without a catalog:
"Where does this number come from?" → Slack DM → Ask 3 people → Answer 2 hours later
With a catalog:
"Where does this number come from?" → Search the catalog → Lineage, owner, and quality in 2 minutes
What belongs in a catalog
| Metadata type | Contents |
|---|---|
| Technical metadata | Schema, data types, partitions, row count, last updated time, storage location |
| Business metadata | Business glossary links, owner, description, use cases, tags |
| Operational metadata | Pipeline run history, quality scores, SLA compliance, incident history |
| Governance metadata | Data classification (PII/public/restricted), access policies, retention period, compliance tags |
Key data catalog tools in 2026
| Tool | Highlights | Best fit |
|---|---|---|
| Atlan | AI-powered metadata, natural language search, governance integration | Modern data stacks |
| Alation | Enterprise-grade, behavior-based recommendations, Soda integration | Large enterprises, finance |
| Collibra | Powerful governance workflows, built for regulated industries | Finance, pharma |
| OpenMetadata | Open-source, self-hosted, highly customizable | Small teams, cost-sensitive |
| DataHub | Open-source, large-scale (LinkedIn-built) | Tech companies |
| Databricks Unity Catalog | Integrated governance within the Databricks ecosystem | Databricks shops |
Catalog adoption roadmap
An effective catalog does not try to register every data asset from day one. Start with the highest-business-impact domains and expand incrementally.
Phase 1 (weeks 1–4): Foundation
- Select tooling and connect the tech stack (Snowflake, dbt, Airflow)
- Register the 30 most critical Gold layer tables
- Assign data owners and write business descriptions
Phase 2 (months 1–3): Automation
- Set up automated metadata crawlers
- Auto-sync dbt docs to the catalog
- Automate PII tagging
- Display data quality scores in the catalog
Phase 3 (months 3–6): Governance depth
- Integrate access request workflows with the catalog
- Build the business glossary
- Enable full lineage visualization
- Transition to managing data as Data Products
6. Data lineage — tracking the journey of data
What is data lineage?
Data lineage tracks the full journey of data from its source through transformation, loading, and consumption. It answers: "Where does this KPI number come from?" all the way to "Which dashboards are affected if I change this source table?"
Data lineage example: tracing a revenue KPI
[Postgres orders table]
│ CDC
▼
[Bronze: raw_orders (Iceberg)]
│ dbt stg_orders
▼
[Silver: stg_orders (cleaned)]
│ dbt fct_daily_revenue
▼
[Gold: fct_daily_revenue]
│
┌───┴──────────┐
▼ ▼
[Tableau [ML model
revenue training
dashboard] data]
Three levels of lineage
① Table-level lineage: Which table is derived from which. fct_orders is built from stg_orders and dim_customers.
② Column-level lineage: Which source column a given column derives from. fct_orders.revenue is computed from stg_orders.amount_cents / 100.
③ Row-level lineage: Traces a specific record back to its source. Used mainly for audit and regulatory purposes. The most granular level, and the most expensive to implement.
OpenLineage — the lineage standard
OpenLineage is an open standard for collecting data lineage. Airflow, dbt, Spark, Flink, and other major tools all support emitting OpenLineage events.
{
"eventType": "COMPLETE",
"eventTime": "2026-04-19T06:00:00Z",
"run": {
"runId": "3b452f3c-a462-4c78-bf8f-f9f553e5c8e1"
},
"job": {
"namespace": "data-platform",
"name": "dbt.fct_daily_revenue"
},
"inputs": [
{
"namespace": "snowflake://company.us-east-1",
"name": "analytics.silver.stg_orders"
},
{
"namespace": "snowflake://company.us-east-1",
"name": "analytics.silver.stg_customers"
}
],
"outputs": [
{
"namespace": "snowflake://company.us-east-1",
"name": "analytics.gold.fct_daily_revenue"
}
]
}
Impact analysis in practice
Before changing a source table schema, use lineage to automatically generate a list of affected downstream assets.
"We plan to drop the customer_email column from the orders table."
│
▼
[Run lineage impact analysis]
│
▼
Affected assets:
⚠️ stg_orders (Silver) — uses customer_email
⚠️ fct_customer_ltv (Gold) — depends on stg_orders
⚠️ Tableau customer dashboard — depends on fct_customer_ltv
⚠️ Marketing email campaign ML model — uses customer_email directly
7. Data observability — the infrastructure of trust
Observability vs monitoring
| Monitoring | Observability | |
|---|---|---|
| Approach | Watches for known failures | Discovers unknown failures |
| Question | "Did this pipeline run?" | "Can I trust this data?" |
| Method | Threshold-based alerts | ML-based anomaly detection |
Data observability applies the APM (Application Performance Monitoring) concept from software systems to data pipelines. In 2026, the winning organizations are not those with the most data — they are those with the most trustworthy data.
The five signals of "data downtime"
The five detection dimensions from the Monte Carlo framework:
| Signal | What it means |
|---|---|
| Freshness | Data older than expected ("Why isn't yesterday's data here yet?") |
| Volume | More or fewer rows than expected (suddenly 0 rows, or a 10× spike) |
| Schema | Unexpected column additions, deletions, or type changes |
| Distribution | Statistical shifts in values (average order amount suddenly 10× higher) |
| Lineage | Automatically tracking the downstream impact of upstream asset changes |
Soda in practice — SQL-based quality checks
Soda integrates SQL-based quality checks seamlessly with dbt, Airflow, and CI/CD pipelines.
# soda_checks_orders.yaml
# Run: soda scan -d snowflake orders
checks for orders:
# Freshness
- freshness(ordered_at) < 2h:
name: "Orders data freshness within 2 hours"
fail: when > 4h
warn: when > 2h
# Volume anomaly detection
- row_count > 0:
name: "Detect empty table"
- row_count between 500 and 1000000:
name: "Daily volume within normal range"
warn: when not between 200 and 2000000
# Completeness
- missing_count(order_id) = 0:
name: "No nulls in order_id"
- missing_percent(customer_id) < 0.1%:
name: "customer_id completeness >= 99.9%"
# Validity
- invalid_percent(order_status) < 0.01%:
valid values: [placed, shipped, delivered, cancelled, refunded]
name: "order_status allowed value check"
# Uniqueness
- duplicate_count(order_id) = 0:
name: "No duplicate order_ids"
# Referential integrity
- referential integrity (customer_id) must exist in customers (customer_id):
name: "Customer ID referential integrity"
Observability tools comparison 2026
| Tool | Strengths | Pricing | Open-source |
|---|---|---|---|
| Monte Carlo | ML-based anomaly detection, auto-lineage, coined "data downtime" | Enterprise | ❌ |
| Soda | SQL-based, excellent dbt/Airflow integration, CI/CD friendly | Free tier available | Partial |
| Great Expectations | Open-source, rich Expectation library | Open-source | ✅ |
| Metaplane | Automated monitoring setup, easy warehouse integration | SMB-friendly | ❌ |
| dbt built-in tests | Quality checks inside dbt projects, no extra tooling needed | Included with dbt plans | ✅ |
8. PII management & compliance automation
PII classification framework
PII (Personally Identifiable Information) is any data that can be used to identify an individual. As AI regulations advance globally — including the EU AI Act, GDPR updates, CCPA, and various national data protection frameworks — managing PII in data pipelines is no longer optional.
| Classification | Example fields | Treatment |
|---|---|---|
| Direct identifiers | Name, SSN, passport number, email | Anonymization or tokenization |
| Quasi-identifiers | Date of birth, address, ZIP code, IP | Pseudonymization or masking |
| Sensitive data | Health records, financial accounts, credit scores | Strict access control + encryption + audit log |
Data masking techniques
# data_masking.py — applied at pipeline ingestion time
import hashlib
import re
class DataMasker:
@staticmethod
def tokenize(value: str, salt: str = "company_secret") -> str:
"""
Tokenize: convert to a consistent pseudonym (preserves JOIN capability).
Suitable for fields like customer_id where analysis requires JOINs.
"""
if value is None:
return None
return hashlib.sha256(f"{value}{salt}".encode()).hexdigest()[:16]
@staticmethod
def mask_email(email: str) -> str:
"""Partial masking: john.doe@company.com → jo***@company.com"""
if not email or "@" not in email:
return email
local, domain = email.split("@", 1)
masked_local = local[:2] + "***"
return f"{masked_local}@{domain}"
@staticmethod
def mask_phone(phone: str) -> str:
"""Phone masking: 010-1234-5678 → 010-****-5678"""
if not phone:
return phone
return re.sub(r'(\d{3})-(\d{4})-(\d{4})', r'\1-****-\3', phone)
@staticmethod
def mask_credit_card(cc: str) -> str:
"""Card masking: 4111-1111-1111-1234 → ****-****-****-1234"""
if not cc:
return cc
digits = re.sub(r'\D', '', cc)
return f"****-****-****-{digits[-4:]}"
@staticmethod
def anonymize(value) -> str:
"""Full anonymization: when individual identification is not needed at all."""
return "REDACTED"
def apply_pii_masking(df):
"""Apply PII masking during Bronze → Silver transition."""
masker = DataMasker()
if "customer_email" in df.columns:
df["customer_email"] = df["customer_email"].apply(masker.mask_email)
if "phone_number" in df.columns:
df["phone_number"] = df["phone_number"].apply(masker.mask_phone)
if "customer_id" in df.columns:
df["customer_id"] = df["customer_id"].apply(masker.tokenize)
if "ssn" in df.columns:
df.drop(columns=["ssn"], inplace=True)
return df
PII flow in the Medallion architecture
[Bronze Layer] ← Raw PII preserved (pii_admin access only)
customer_email: "john@co.com"
phone: "010-1234-5678"
customer_id: "usr_abc123"
│
│ Masking applied automatically in Bronze → Silver pipeline
▼
[Silver Layer] ← Masked data (analytics team can access)
customer_email: "jo***@co.com"
phone: "010-****-5678"
customer_id: "a3f2b9..." ← Tokenized (JOIN-capable)
│
▼
[Gold Layer] ← Aggregated data with no PII (org-wide access)
revenue_by_segment: {...}
daily_order_count: 1234
9. Access control & role design
Principle of Least Privilege
Every user and service should have only the minimum permissions required for their job. This is also the most frequently violated security principle in data platforms.
-- Snowflake RBAC example
-- Role hierarchy design
CREATE ROLE analyst_gold; -- Gold read-only (all analysts)
CREATE ROLE analyst_silver; -- Silver read-only
CREATE ROLE analyst_bronze; -- Bronze read-only
CREATE ROLE data_engineer; -- Silver/Gold write + Bronze read
CREATE ROLE pii_admin; -- Full read including Bronze PII columns
-- Role inheritance
GRANT ROLE analyst_bronze TO ROLE analyst_silver;
GRANT ROLE analyst_silver TO ROLE analyst_gold;
GRANT ROLE analyst_gold TO ROLE data_engineer;
-- Table-level permissions
GRANT SELECT ON ALL TABLES IN SCHEMA gold TO ROLE analyst_gold;
GRANT SELECT ON ALL TABLES IN SCHEMA silver TO ROLE analyst_silver;
-- Column-level Dynamic Data Masking
CREATE MASKING POLICY email_mask AS (val STRING) RETURNS STRING ->
CASE
WHEN CURRENT_ROLE() IN ('pii_admin') THEN val
WHEN CURRENT_ROLE() IN ('data_engineer') THEN
REGEXP_REPLACE(val, '(^.{2}).*(@.*)', '\\1***\\2')
ELSE '***REDACTED***'
END;
ALTER TABLE silver.customers
MODIFY COLUMN email
SET MASKING POLICY email_mask;
Data access request workflow
Analyst A needs access to PII-containing data
│
▼
[Submit access request via data catalog]
- Requested dataset: bronze.customers (contains PII)
- Purpose: churn customer segmentation analysis
- Duration: 30 days
│
▼
[Automated review triggers]
✅ Does requester's department match the data use policy?
✅ GDPR purpose limitation check (data minimization principle)
✅ Can the existing Silver masked data satisfy the need? → No
│
▼
[Approval notification sent to data owner]
→ Owner approves (24-hour SLA)
│
▼
[Temporary access automatically granted]
+ Audit log automatically created
+ Access automatically expires after 30 days
10. Practical checklist
Data quality
- All Gold tables covered by checks across all six quality dimensions?
- Quality gates in place at each pipeline stage (Bronze→Silver, Silver→Gold)?
- Quality SLA tiers (Gold/Silver/Bronze) defined per dataset?
- Automatic alerting and pipeline blocking on quality check failures?
- dbt tests or Great Expectations / Soda integrated into CI/CD?
Governance framework
- Data owner assigned for all major datasets?
- Data classification scheme (public/internal/restricted/confidential) defined and applied?
- Governance policies managed as code (Policy as Code), not documents?
- Governance model (centralized/federated/hybrid) chosen to match org structure?
Data Contracts
- Data Contracts (YAML specifications) written for major Gold tables?
- Contracts automatically validated in the CI/CD pipeline?
- Process in place to notify consumers 14+ days before breaking changes?
- Schema Registry applied to Kafka streaming data?
Data catalog & lineage
- Data catalog deployed with major assets registered?
- Data lineage (source → transform → consume) tracked automatically?
- Downstream impact analysis possible before schema changes?
- Business glossary linked to the catalog?
PII & compliance
- PII columns automatically detected and tagged?
- PII masking automatically applied during Bronze→Silver transition?
- Column-level access control (Dynamic Data Masking) in place?
- Access request/approval/expiry workflow automated?
- Data retention policies defined and automatically enforced?
- Audit logs generated in compliance with applicable regulations (GDPR/CCPA/local data protection laws)?
Closing thoughts
Data quality and governance are not "build it once and you're done" projects. They are living systems that must evolve continuously as business changes, data grows, and regulations tighten.
The most important paradigm shift of 2026 is this: stop viewing governance as "regulation that slows down data engineering," and start viewing it as "an engineering practice that guarantees the reliability of data products." Pipelines with governance built in enable faster decision-making, more trustworthy AI, and operations free from compliance risk.
The next part goes deep into the cloud infrastructure and cost optimization (FinOps) that underpins all of this — IaC, multi-cloud strategy, and compute cost governance.
Part 5 preview: Cloud & Infrastructure Deep Dive
- AWS vs GCP vs Azure data platform comparison
- Codifying data infrastructure with Terraform (IaC)
- FinOps — practical cloud data cost optimization
- Operating data platforms on Kubernetes
- Multi-cloud & hybrid cloud strategies