Learn Big: Data Governance Explained — Catalogs, Lineage, Quality & GDPR Compliance

Data Governance Explained — Catalogs, Lineage, Quality & GDPR Compliance

DodaTech Updated Jun 15, 2026 6 min read

Data governance is the set of policies, processes, and tools that ensure data is accurate, accessible, consistent, and protected across an organization.

What You’ll Learn

In this tutorial, you’ll learn data catalogs, data lineage, data quality frameworks, metadata management, GDPR/CCPA compliance requirements, data contracts, and tools like Apache Atlas and DataHub.

Why It Matters

Without data governance, organizations end up with data silos, inconsistent metrics, compliance violations, and untrustworthy data. Poor data quality costs US businesses $3.1 trillion annually. Governance is mandatory for regulated industries.

Real-World Use

A bank needs to know exactly where customer data flows, who accesses it, and how it’s processed. When a regulator asks “show us all data related to customer X,” the governance system traces lineage from the source system through every transformation to reports. Durga Antivirus Pro’s data handling follows strict governance policies.


graph TD
  subgraph "Data Governance Pillars"
    A[Data Catalog] --> B[Metadata Management]
    C[Data Lineage] --> D[Impact Analysis]
    E[Data Quality] --> F[Trust & Reliability]
    G[Compliance] --> H[GDPR / CCPA]
  end
  I[Data Contracts] --> J[Producer/Consumer Agreements]
  A --> K[Search & Discovery]
  C --> K
  E --> L[DataOps]
  G --> L

Data Catalog

A data catalog is an inventory of all data assets in an organization. Think of it as a search engine for your data.

Feature	Purpose
Data discovery	Find datasets, tables, and columns
Business glossary	Define what terms mean (e.g., “Active Customer” = paid in last 90 days)
Tagging	Label data by domain, PII status, quality level
Certification	Mark datasets as “trusted” or “bronze/silver/gold”

Tools

Apache Atlas: Open-source, integrated with Hadoop ecosystem
DataHub: LinkedIn’s open-source metadata platform
Amundsen: Lyft’s data discovery platform
Alation, Collibra: Enterprise commercial catalogs

Data Lineage

Lineage shows where data comes from, how it’s transformed, and where it goes. End-to-end lineage traces from source systems through ETL pipelines to dashboards.

class LineageNode:
    def __init__(self, name, node_type):
        self.name = name
        self.node_type = node_type  # source, transform, sink
        self.inputs = []
        self.outputs = []

    def add_input(self, node):
        self.inputs.append(node)

    def add_output(self, node):
        self.outputs.append(node)

class LineageGraph:
    def __init__(self):
        self.nodes = {}

    def add_node(self, name, node_type):
        self.nodes[name] = LineageNode(name, node_type)

    def add_edge(self, from_node, to_node):
        self.nodes[from_node].add_output(self.nodes[to_node])
        self.nodes[to_node].add_input(self.nodes[from_node])

    def trace_upstream(self, node_name, depth=0):
        node = self.nodes.get(node_name)
        if not node:
            return
        prefix = "  " * depth
        print(f"{prefix}{node.name} ({node.node_type})")
        for inp in node.inputs:
            self.trace_upstream(inp.name, depth + 1)

    def trace_downstream(self, node_name, depth=0):
        node = self.nodes.get(node_name)
        if not node:
            return
        prefix = "  " * depth
        print(f"{prefix}{node.name} ({node.node_type})")
        for out in node.outputs:
            self.trace_downstream(out.name, depth + 1)

# Build a lineage graph
g = LineageGraph()
g.add_node("orders_db", "database")
g.add_node("kafka_orders", "stream")
g.add_node("etl_job", "transform")
g.add_node("sales_dw", "data_warehouse")
g.add_node("sales_report", "dashboard")

g.add_edge("orders_db", "kafka_orders")
g.add_edge("kafka_orders", "etl_job")
g.add_edge("etl_job", "sales_dw")
g.add_edge("sales_dw", "sales_report")

print("=== Upstream from sales_report ===")
g.trace_upstream("sales_report")
print("\n=== Downstream from orders_db ===")
g.trace_downstream("orders_db")

Expected output:

=== Upstream from sales_report ===
sales_report (dashboard)
  sales_dw (data_warehouse)
    etl_job (transform)
      kafka_orders (stream)
        orders_db (database)

=== Downstream from orders_db ===
orders_db (database)
  kafka_orders (stream)
    etl_job (transform)
      sales_dw (data_warehouse)
        sales_report (dashboard)

Data Quality

Data quality is measured across six dimensions:

Dimension	Definition	Example
Completeness	Are all required values present?	Customer email is 95% filled
Accuracy	Is the data correct?	Revenue matches invoice system
Consistency	Is it the same across systems?	Same customer name in CRM and billing
Timeliness	Is it up to date?	Stock prices within 1 second
Uniqueness	Are there duplicates?	No duplicate customer records
Validity	Does it conform to rules?	Email matches regex pattern

GDPR and CCPA Compliance

Requirement	What It Means
Right to be forgotten	Delete all data for a user on request
Data portability	Export user data in machine-readable format
Consent management	Record what users consented to and when
Data retention	Automatically delete data after policy period
Breach notification	Report breaches within 72 hours
DSR (Data Subject Request)	Respond to user data requests within 30 days

class GDPRCompliance:
    def __init__(self):
        self.consents = {}  # user_id -> {purpose: timestamp}
        self.personal_data = {}  # user_id -> {field: value}

    def record_consent(self, user_id, purpose):
        self.consents.setdefault(user_id, {})[purpose] = datetime.now()
        print(f"Consent recorded for {user_id}: {purpose}")

    def check_consent(self, user_id, purpose):
        return purpose in self.consents.get(user_id, {})

    def delete_user(self, user_id):
        """Right to be forgotten"""
        self.consents.pop(user_id, None)
        self.personal_data.pop(user_id, None)
        print(f"All data deleted for {user_id}")

    def export_user(self, user_id):
        """Data portability"""
        return {
            "consents": self.consents.get(user_id, {}),
            "data": self.personal_data.get(user_id, {}),
        }

compliance = GDPRCompliance()
compliance.record_consent("user_42", "email_marketing")
compliance.record_consent("user_42", "analytics")
print(f"Can email user_42? {compliance.check_consent('user_42', 'email_marketing')}")
compliance.delete_user("user_42")

Expected output:

Consent recorded for user_42: email_marketing
Consent recorded for user_42: analytics
Can email user_42? True
All data deleted for user_42

Data Contracts

A data contract is an agreement between data producers and consumers specifying schema, SLAs, and quality guarantees.

# data_contract.yaml
dataset: customer_orders
producer: order_service
schema:
  fields:
    - name: order_id
      type: string
      required: true
    - name: customer_email
      type: string
      format: email
      pii: true
    - name: amount
      type: double
      constraints:
        - min: 0
    - name: created_at
      type: timestamp
      freshness: 60m  # Data must be < 60 min old
quality_sla:
  completeness: 0.99
  accuracy: 0.999

Common Mistakes

Governance without automation: Manual governance doesn’t scale. Automate lineage, quality checks, and compliance workflows.
Treating governance as a one-time project: Governance is ongoing. Policies need regular review and enforcement.
Ignoring data lineage until something breaks: Without lineage, a broken dashboard takes days to diagnose. Document lineage from day one.
Setting unrealistic data quality thresholds: 100% accuracy is rarely achievable. Set realistic targets and measure improvement.
Not classifying PII data upfront: When PII isn’t tagged, compliance requests (GDPR) require full scans. Tag PII at ingestion.

Practice Questions

What is data lineage? The record of data’s origin, transformations, and movement across systems. It enables impact analysis and debugging.
What are the six dimensions of data quality? Completeness, Accuracy, Consistency, Timeliness, Uniqueness, Validity.
What is a data catalog? A searchable inventory of data assets with metadata, business definitions, and ownership information.
What does the GDPR “right to be forgotten” require? Organizations must delete all personal data for a user upon request, including from backups and downstream systems.
What is a data contract? A formal agreement between data producers and consumers specifying schema, freshness, and quality SLAs.

Challenge

Design a data governance framework for an e-commerce company. Include: catalog structure, lineage tracking approach, data quality checks for the “orders” dataset, and GDPR compliance workflows.

Real-World Task

Check if your company has a data catalog. If yes, find a dataset you frequently use — is its metadata accurate? Is lineage documented? Are quality checks in place?

Mini Project: Data Quality Monitor

Build a Python tool that connects to a database, runs data quality checks (completeness, uniqueness, freshness, validity), and generates a report. Use Great Expectations if possible.

Security angle: Data governance and security go hand in hand. Knowing where sensitive data lives, who accesses it, and how it flows is essential for compliance and breach prevention.

What’s Next

Real-Time Analytics — Next Lesson

Review: Apache Flink

Review: Data Warehousing

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

What’s Next

Congratulations on completing this Data Governance tutorial! Here’s where to go from here:

Practice daily — Consistency is more important than long study sessions
Build a project — Apply what you learned by building something real
Explore related topics — Check out other tutorials in the same category
Join the community — Discuss with other learners and share your progress

Remember: every expert was once a beginner. Keep coding!

Previous Apache Flink — Stream Processing, Event Time, Watermarks & Windowing Next Real-Time Analytics Explained — Streaming Architectures, Lambda vs Kappa & Dashboards

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Big Data & Analytics