Data Governance Explained — Catalogs, Lineage, Quality & GDPR Compliance
Data governance is the set of policies, processes, and tools that ensure data is accurate, accessible, consistent, and protected across an organization.
What You’ll Learn
In this tutorial, you’ll learn data catalogs, data lineage, data quality frameworks, metadata management, GDPR/CCPA compliance requirements, data contracts, and tools like Apache Atlas and DataHub.
Why It Matters
Without data governance, organizations end up with data silos, inconsistent metrics, compliance violations, and untrustworthy data. Poor data quality costs US businesses $3.1 trillion annually. Governance is mandatory for regulated industries.
Real-World Use
A bank needs to know exactly where customer data flows, who accesses it, and how it’s processed. When a regulator asks “show us all data related to customer X,” the governance system traces lineage from the source system through every transformation to reports. Durga Antivirus Pro’s data handling follows strict governance policies.
graph TD
subgraph "Data Governance Pillars"
A[Data Catalog] --> B[Metadata Management]
C[Data Lineage] --> D[Impact Analysis]
E[Data Quality] --> F[Trust & Reliability]
G[Compliance] --> H[GDPR / CCPA]
end
I[Data Contracts] --> J[Producer/Consumer Agreements]
A --> K[Search & Discovery]
C --> K
E --> L[DataOps]
G --> L
Data Catalog
A data catalog is an inventory of all data assets in an organization. Think of it as a search engine for your data.
| Feature | Purpose |
|---|---|
| Data discovery | Find datasets, tables, and columns |
| Business glossary | Define what terms mean (e.g., “Active Customer” = paid in last 90 days) |
| Tagging | Label data by domain, PII status, quality level |
| Certification | Mark datasets as “trusted” or “bronze/silver/gold” |
Tools
- Apache Atlas: Open-source, integrated with Hadoop ecosystem
- DataHub: LinkedIn’s open-source metadata platform
- Amundsen: Lyft’s data discovery platform
- Alation, Collibra: Enterprise commercial catalogs
Data Lineage
Lineage shows where data comes from, how it’s transformed, and where it goes. End-to-end lineage traces from source systems through ETL pipelines to dashboards.
class LineageNode:
def __init__(self, name, node_type):
self.name = name
self.node_type = node_type # source, transform, sink
self.inputs = []
self.outputs = []
def add_input(self, node):
self.inputs.append(node)
def add_output(self, node):
self.outputs.append(node)
class LineageGraph:
def __init__(self):
self.nodes = {}
def add_node(self, name, node_type):
self.nodes[name] = LineageNode(name, node_type)
def add_edge(self, from_node, to_node):
self.nodes[from_node].add_output(self.nodes[to_node])
self.nodes[to_node].add_input(self.nodes[from_node])
def trace_upstream(self, node_name, depth=0):
node = self.nodes.get(node_name)
if not node:
return
prefix = " " * depth
print(f"{prefix}{node.name} ({node.node_type})")
for inp in node.inputs:
self.trace_upstream(inp.name, depth + 1)
def trace_downstream(self, node_name, depth=0):
node = self.nodes.get(node_name)
if not node:
return
prefix = " " * depth
print(f"{prefix}{node.name} ({node.node_type})")
for out in node.outputs:
self.trace_downstream(out.name, depth + 1)
# Build a lineage graph
g = LineageGraph()
g.add_node("orders_db", "database")
g.add_node("kafka_orders", "stream")
g.add_node("etl_job", "transform")
g.add_node("sales_dw", "data_warehouse")
g.add_node("sales_report", "dashboard")
g.add_edge("orders_db", "kafka_orders")
g.add_edge("kafka_orders", "etl_job")
g.add_edge("etl_job", "sales_dw")
g.add_edge("sales_dw", "sales_report")
print("=== Upstream from sales_report ===")
g.trace_upstream("sales_report")
print("\n=== Downstream from orders_db ===")
g.trace_downstream("orders_db")Expected output:
=== Upstream from sales_report ===
sales_report (dashboard)
sales_dw (data_warehouse)
etl_job (transform)
kafka_orders (stream)
orders_db (database)
=== Downstream from orders_db ===
orders_db (database)
kafka_orders (stream)
etl_job (transform)
sales_dw (data_warehouse)
sales_report (dashboard)Data Quality
Data quality is measured across six dimensions:
| Dimension | Definition | Example |
|---|---|---|
| Completeness | Are all required values present? | Customer email is 95% filled |
| Accuracy | Is the data correct? | Revenue matches invoice system |
| Consistency | Is it the same across systems? | Same customer name in CRM and billing |
| Timeliness | Is it up to date? | Stock prices within 1 second |
| Uniqueness | Are there duplicates? | No duplicate customer records |
| Validity | Does it conform to rules? | Email matches regex pattern |
GDPR and CCPA Compliance
| Requirement | What It Means |
|---|---|
| Right to be forgotten | Delete all data for a user on request |
| Data portability | Export user data in machine-readable format |
| Consent management | Record what users consented to and when |
| Data retention | Automatically delete data after policy period |
| Breach notification | Report breaches within 72 hours |
| DSR (Data Subject Request) | Respond to user data requests within 30 days |
class GDPRCompliance:
def __init__(self):
self.consents = {} # user_id -> {purpose: timestamp}
self.personal_data = {} # user_id -> {field: value}
def record_consent(self, user_id, purpose):
self.consents.setdefault(user_id, {})[purpose] = datetime.now()
print(f"Consent recorded for {user_id}: {purpose}")
def check_consent(self, user_id, purpose):
return purpose in self.consents.get(user_id, {})
def delete_user(self, user_id):
"""Right to be forgotten"""
self.consents.pop(user_id, None)
self.personal_data.pop(user_id, None)
print(f"All data deleted for {user_id}")
def export_user(self, user_id):
"""Data portability"""
return {
"consents": self.consents.get(user_id, {}),
"data": self.personal_data.get(user_id, {}),
}
compliance = GDPRCompliance()
compliance.record_consent("user_42", "email_marketing")
compliance.record_consent("user_42", "analytics")
print(f"Can email user_42? {compliance.check_consent('user_42', 'email_marketing')}")
compliance.delete_user("user_42")Expected output:
Consent recorded for user_42: email_marketing
Consent recorded for user_42: analytics
Can email user_42? True
All data deleted for user_42Data Contracts
A data contract is an agreement between data producers and consumers specifying schema, SLAs, and quality guarantees.
# data_contract.yaml
dataset: customer_orders
producer: order_service
schema:
fields:
- name: order_id
type: string
required: true
- name: customer_email
type: string
format: email
pii: true
- name: amount
type: double
constraints:
- min: 0
- name: created_at
type: timestamp
freshness: 60m # Data must be < 60 min old
quality_sla:
completeness: 0.99
accuracy: 0.999Common Mistakes
- Governance without automation: Manual governance doesn’t scale. Automate lineage, quality checks, and compliance workflows.
- Treating governance as a one-time project: Governance is ongoing. Policies need regular review and enforcement.
- Ignoring data lineage until something breaks: Without lineage, a broken dashboard takes days to diagnose. Document lineage from day one.
- Setting unrealistic data quality thresholds: 100% accuracy is rarely achievable. Set realistic targets and measure improvement.
- Not classifying PII data upfront: When PII isn’t tagged, compliance requests (GDPR) require full scans. Tag PII at ingestion.
Practice Questions
What is data lineage? The record of data’s origin, transformations, and movement across systems. It enables impact analysis and debugging.
What are the six dimensions of data quality? Completeness, Accuracy, Consistency, Timeliness, Uniqueness, Validity.
What is a data catalog? A searchable inventory of data assets with metadata, business definitions, and ownership information.
What does the GDPR “right to be forgotten” require? Organizations must delete all personal data for a user upon request, including from backups and downstream systems.
What is a data contract? A formal agreement between data producers and consumers specifying schema, freshness, and quality SLAs.
Challenge
Design a data governance framework for an e-commerce company. Include: catalog structure, lineage tracking approach, data quality checks for the “orders” dataset, and GDPR compliance workflows.
Real-World Task
Check if your company has a data catalog. If yes, find a dataset you frequently use — is its metadata accurate? Is lineage documented? Are quality checks in place?
Mini Project: Data Quality Monitor
Build a Python tool that connects to a database, runs data quality checks (completeness, uniqueness, freshness, validity), and generates a report. Use Great Expectations if possible.
Security angle: Data governance and security go hand in hand. Knowing where sensitive data lives, who accesses it, and how it flows is essential for compliance and breach prevention.
What’s Next
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
What’s Next
Congratulations on completing this Data Governance tutorial! Here’s where to go from here:
- Practice daily — Consistency is more important than long study sessions
- Build a project — Apply what you learned by building something real
- Explore related topics — Check out other tutorials in the same category
- Join the community — Discuss with other learners and share your progress
Remember: every expert was once a beginner. Keep coding!
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro