BlogData GovernanceData Lineage in Healthcare: Why Your Compliance Team Should Care
Data Governance

Data Lineage in Healthcare: Why Your Compliance Team Should Care

Data lineage is not just a data engineering concern. For HIPAA breach reporting and RADV audits, knowing exactly where PHI moved — and when — is a compliance requirement.

mdatool Team·April 21, 2026·8 min read
data lineageHIPAARADVcomplianceSnowflakeDatabricks

Introduction

When OCR opens a HIPAA breach investigation, the first question is not "what was in the breach?" — it is "show us everywhere that data went." If your team cannot trace a member record from its source system through every transformation, join, and output table to the point of exposure, you have a serious problem. That is a data lineage problem, and it is not hypothetical. [RADV](/terms/RADV) auditors ask the same question about risk adjustment data: how did this diagnosis code get into the final submission, and what were its intermediate states?

Data lineage in healthcare is a compliance control. Treating it as a nice-to-have engineering feature is a mistake your compliance team will eventually pay for.


What Data Lineage Actually Means

Data lineage tracks the origin, movement, and transformation of data over time. In a healthcare data warehouse, this means answering:

  • Where did this member record come from (Epic, Facets, enrollment file)?
  • What transformations were applied (deduplication, standard code mapping, type casting)?
  • What downstream tables or reports consume it?
  • Who accessed it at each stage?
  • When did it change, and what changed?

Lineage operates at two levels:

Column-level lineage: tracks how individual fields propagate through transformations. This is what you need for HIPAA breach scope — specifically, which PHI fields were in scope at the point of exposure.

Table-level lineage: tracks data flow at the table/dataset level. Sufficient for most audit readiness use cases but insufficient for breach investigation.


Why HIPAA Breach Reporting Requires Lineage

Under HIPAA's Breach Notification Rule, a covered entity must notify affected individuals within 60 days of discovering a breach. To meet that requirement, your team must be able to:

  1. Identify the set of PHI that was exposed (which records, which fields)
  2. Determine who had access to those records
  3. Establish the timeline of exposure

Without column-level lineage, step 1 becomes a manual investigation that can take weeks. With lineage, you can query your graph: "Show me every table and user that touched member records from this source between these dates."


Why RADV Audits Require Lineage

CMS's Risk Adjustment Data Validation (RADV) audits verify that diagnosis codes submitted for risk adjustment are supported by medical record documentation. Auditors select a sample of HCC-coded claims and trace them back to the source.

Your data team needs to answer: "Show me the transformation from raw 837 claim through condition normalization to the final HCC submission." Without lineage, that reconstruction is manual and error-prone. With lineage, it is a query.


Implementing Lineage in Snowflake

Snowflake's Access History and Object Dependencies features provide foundational lineage data. For column-level lineage, Snowflake's ACCESS_HISTORY view in the SNOWFLAKE.ACCOUNT_USAGE schema tracks which columns were read in each query.

-- Find every query that accessed PHI columns in the last 30 days
SELECT
  query_id,
  user_name,
  role_name,
  query_start_time,
  base_objects_accessed
FROM snowflake.account_usage.access_history
WHERE query_start_time >= DATEADD('day', -30, CURRENT_TIMESTAMP)
  AND EXISTS (
    SELECT 1
    FROM LATERAL FLATTEN(input => base_objects_accessed) obj
    WHERE obj.value:objectName::STRING ILIKE '%member_demographics%'
  )
ORDER BY query_start_time DESC;

For full pipeline lineage (not just query lineage), pair Snowflake Access History with a lineage tool that reads your dbt DAG. dbt's manifest.json encodes the full transformation graph — every model, its SQL, its upstream dependencies, and its downstream consumers.


Implementing Lineage in Databricks

Databricks Unity Catalog provides column-level lineage natively as of Runtime 11.3+. Enable it at the catalog level and query the lineage graph via the Unity Catalog UI or REST API.

For pipeline lineage in Spark jobs:

# Use Delta Lake's history API to track changes to a PHI table
from delta.tables import DeltaTable

delta_table = DeltaTable.forName(spark, "phi.member_demographics")
history = delta_table.history(50)  # last 50 operations

history.select(
    "version", "timestamp", "operation",
    "operationParameters", "userMetadata"
).show(truncate=False)

Delta's time travel combined with Unity Catalog lineage gives you both "what changed" and "who read it" — the two questions every breach investigation starts with.


Lineage Architecture Pattern

A practical lineage architecture for a healthcare data warehouse has three components:

1. Capture layer: Automated scanners that extract lineage from SQL transformations (dbt manifest, Spark query plans, stored procedure ASTs), pipeline orchestrators (Airflow DAGs), and cloud platform APIs (Snowflake Access History, BigQuery Lineage).

2. Storage layer: A lineage graph database (or a purpose-built catalog like Atlan, DataHub, or OpenLineage-compatible store) that stores nodes (tables, columns, pipelines) and edges (upstream/downstream relationships).

3. Query layer: An interface — ideally API-accessible — that lets compliance, audit, and engineering teams trace data paths, identify PHI exposure scope, and export audit evidence.


The Schema Drift Problem

Lineage graphs break when schemas change without notice. A dropped column, a renamed table, or a type change silently severs lineage edges. For PHI columns, this is particularly dangerous — a renamed PHI field can fall out of your masking policy if your catalog does not track the rename.

Gate schema changes with a diff tool. Use the Schema Diff to generate a precise diff of any DDL change before it deploys — giving your governance team visibility into structural changes that affect lineage and masking coverage.


Key Takeaways

  • Column-level lineage is required for HIPAA breach scope determination. Table-level lineage alone is insufficient.
  • RADV audits require traceable transformation paths from raw claim through HCC submission. Lineage makes this a query, not a manual reconstruction.
  • Snowflake Access History and Databricks Unity Catalog provide native lineage capture. Pair with dbt manifest for transformation graph completeness.
  • Lineage graphs break when schemas change silently. Use Schema Diff to gate DDL changes and protect lineage coverage.
  • Start with your PHI tables. Full-warehouse lineage is aspirational; PHI-complete lineage is achievable and defensible.
M

mdatool Team

The mdatool team builds free engineering tools for healthcare data architects, analysts, and engineers working across payer, provider, and life sciences data.

Ready to improve your data architecture?

Free tools for DDL conversion, SQL analysis, naming standards, and more.

Get Started Free