Introduction
When OCR opens a HIPAA breach investigation, the first question is not "what was in the breach?" — it is "show us everywhere that data went." If your team cannot trace a member record from its source system through every transformation, join, and output table to the point of exposure, you have a serious problem. That is a data lineage problem, and it is not hypothetical. [RADV](/terms/RADV) auditors ask the same question about risk adjustment data: how did this diagnosis code get into the final submission, and what were its intermediate states?
Data lineage in healthcare is a compliance control. Treating it as a nice-to-have engineering feature is a mistake your compliance team will eventually pay for.
What Data Lineage Actually Means
Data lineage tracks the origin, movement, and transformation of data over time. In a healthcare data warehouse, this means answering:
- Where did this member record come from (Epic, Facets, enrollment file)?
- What transformations were applied (deduplication, standard code mapping, type casting)?
- What downstream tables or reports consume it?
- Who accessed it at each stage?
- When did it change, and what changed?
Lineage operates at two levels:
Column-level lineage: tracks how individual fields propagate through transformations. This is what you need for HIPAA breach scope — specifically, which PHI fields were in scope at the point of exposure.
Table-level lineage: tracks data flow at the table/dataset level. Sufficient for most audit readiness use cases but insufficient for breach investigation.
Why HIPAA Breach Reporting Requires Lineage
Under HIPAA's Breach Notification Rule, a covered entity must notify affected individuals within 60 days of discovering a breach. To meet that requirement, your team must be able to:
- Identify the set of PHI that was exposed (which records, which fields)
- Determine who had access to those records
- Establish the timeline of exposure
Without column-level lineage, step 1 becomes a manual investigation that can take weeks. With lineage, you can query your graph: "Show me every table and user that touched member records from this source between these dates."
Why RADV Audits Require Lineage
CMS's Risk Adjustment Data Validation (RADV) audits verify that diagnosis codes submitted for risk adjustment are supported by medical record documentation. Auditors select a sample of HCC-coded claims and trace them back to the source.
Your data team needs to answer: "Show me the transformation from raw 837 claim through condition normalization to the final HCC submission." Without lineage, that reconstruction is manual and error-prone. With lineage, it is a query.
Implementing Lineage in Snowflake
Snowflake's Access History and Object Dependencies features provide foundational lineage data. For column-level lineage, Snowflake's ACCESS_HISTORY view in the SNOWFLAKE.ACCOUNT_USAGE schema tracks which columns were read in each query.
-- Find every query that accessed PHI columns in the last 30 days SELECT query_id, user_name, role_name, query_start_time, base_objects_accessed FROM snowflake.account_usage.access_history WHERE query_start_time >= DATEADD('day', -30, CURRENT_TIMESTAMP) AND EXISTS ( SELECT 1 FROM LATERAL FLATTEN(input => base_objects_accessed) obj WHERE obj.value:objectName::STRING ILIKE '%member_demographics%' ) ORDER BY query_start_time DESC;
For full pipeline lineage (not just query lineage), pair Snowflake Access History with a lineage tool that reads your dbt DAG. dbt's manifest.json encodes the full transformation graph — every model, its SQL, its upstream dependencies, and its downstream consumers.
Implementing Lineage in Databricks
Databricks Unity Catalog provides column-level lineage natively as of Runtime 11.3+. Enable it at the catalog level and query the lineage graph via the Unity Catalog UI or REST API.
For pipeline lineage in Spark jobs:
# Use Delta Lake's history API to track changes to a PHI table from delta.tables import DeltaTable delta_table = DeltaTable.forName(spark, "phi.member_demographics") history = delta_table.history(50) # last 50 operations history.select( "version", "timestamp", "operation", "operationParameters", "userMetadata" ).show(truncate=False)
Delta's time travel combined with Unity Catalog lineage gives you both "what changed" and "who read it" — the two questions every breach investigation starts with.
Lineage Architecture Pattern
A practical lineage architecture for a healthcare data warehouse has three components:
1. Capture layer: Automated scanners that extract lineage from SQL transformations (dbt manifest, Spark query plans, stored procedure ASTs), pipeline orchestrators (Airflow DAGs), and cloud platform APIs (Snowflake Access History, BigQuery Lineage).
2. Storage layer: A lineage graph database (or a purpose-built catalog like Atlan, DataHub, or OpenLineage-compatible store) that stores nodes (tables, columns, pipelines) and edges (upstream/downstream relationships).
3. Query layer: An interface — ideally API-accessible — that lets compliance, audit, and engineering teams trace data paths, identify PHI exposure scope, and export audit evidence.
The Schema Drift Problem
Lineage graphs break when schemas change without notice. A dropped column, a renamed table, or a type change silently severs lineage edges. For PHI columns, this is particularly dangerous — a renamed PHI field can fall out of your masking policy if your catalog does not track the rename.
Gate schema changes with a diff tool. Use the Schema Diff to generate a precise diff of any DDL change before it deploys — giving your governance team visibility into structural changes that affect lineage and masking coverage.
Key Takeaways
- Column-level lineage is required for HIPAA breach scope determination. Table-level lineage alone is insufficient.
- RADV audits require traceable transformation paths from raw claim through HCC submission. Lineage makes this a query, not a manual reconstruction.
- Snowflake Access History and Databricks Unity Catalog provide native lineage capture. Pair with dbt manifest for transformation graph completeness.
- Lineage graphs break when schemas change silently. Use Schema Diff to gate DDL changes and protect lineage coverage.
- Start with your PHI tables. Full-warehouse lineage is aspirational; PHI-complete lineage is achievable and defensible.
mdatool Team
The mdatool team builds free engineering tools for healthcare data architects, analysts, and engineers working across payer, provider, and life sciences data.
Related Guides
Key Terms in This Article
More in Data Governance
SOC 2 Type II for Healthcare Data Platforms: What Engineers Need to Know
SOC 2 Type II is increasingly a vendor requirement and a customer expectation for healthcare data platforms. Here is what engineers need to implement — beyond what the auditors tell you.
Read more21st Century Cures Act: Data Architecture Requirements for Health IT Teams
The 21st Century Cures Act is not just a compliance checkbox — it mandates specific technical capabilities around open APIs, information blocking prohibition, and patient data access. Here is what your data architecture must deliver.
Read moreCMS Interoperability Rule Compliance: What Your Data Architecture Must Support
CMS-9115-F and its successors are not just policy — they are architectural requirements. Patient Access API, Provider Directory API, payer-to-payer exchange, and prior auth APIs each require specific technical capabilities your data team must build.
Read moreReady to improve your data architecture?
Free tools for DDL conversion, SQL analysis, naming standards, and more.