BlogData ArchitectureHealthcare Data Lakehouse Architecture: Building on Delta Lake for Payers
Data Architecture

Healthcare Data Lakehouse Architecture: Building on Delta Lake for Payers

The data lakehouse pattern — combining the scalability of a data lake with the ACID guarantees of a warehouse — is a natural fit for payer data. Here is how to build it on Delta Lake, layer by layer.

mdatool Team·April 21, 2026·9 min read
data lakehouseDelta Lakehealthcare data architecturepayerbronze silver goldDatabricks

Introduction

The medallion architecture — bronze, silver, gold — has become the standard lakehouse pattern for good reason. For healthcare payers, it maps cleanly to the reality of claims data: messy, vendor-specific raw files arrive at the bronze layer; standardized, validated datasets sit in silver; [HEDIS](/terms/HEDIS) measures, [HCC](/terms/HCC) risk scores, and actuarial outputs live in gold. Delta Lake provides the ACID transactions, time travel, and schema enforcement that make this pattern production-grade for PHI-bearing healthcare data.

This guide walks through the architecture layer by layer, with specific attention to healthcare payer data — 837 and 835 files, member enrollment, provider registry, and the quality and risk models that consume them.


Why Delta Lake for Healthcare Data

Before getting into schema design, it is worth being explicit about why Delta Lake is the right storage layer for healthcare payers:

ACID transactions: Healthcare pipelines process 835 remittance files that arrive late, out of order, and sometimes duplicate. ACID transactions ensure that a partial load of a 835 file does not corrupt the claims payment history. Either the transaction commits completely or it does not commit at all.

Schema enforcement: Delta Lake enforces schema at write time. A malformed 837 file with an unexpected column type fails loudly at the bronze layer rather than silently corrupting the silver table.

Time travel: [RADV](/terms/RADV) audits require you to produce the data as it existed at a specific point in time. Delta's time travel (via version number or timestamp) makes this a query, not a reconstruction:

# Reconstruct HCC submission data as of the RADV audit cutoff date
hcc_at_audit = spark.read.format("delta")     .option("timestampAsOf", "2024-06-30T23:59:59Z")     .load("/mnt/delta/gold/hcc_submission")

Schema evolution: Healthcare data evolves. New 837 loop segments get added. CMS introduces new encounter data elements. Delta Lake handles schema evolution without rewriting historical data.


Bronze Layer: Raw Ingestion

The bronze layer stores raw data exactly as it arrives — no transformation, no standardization, no column renaming. The goal is fidelity and auditability.

837 Professional Claims (Bronze)

CREATE TABLE bronze.raw_837p (
  file_name          STRING,           -- source file identifier
  transaction_set    STRING,           -- 837P
  interchange_id     STRING,           -- ISA control number
  billing_npi        STRING,           -- NM1*85 loop
  rendering_npi      STRING,           -- NM1*82 loop
  member_id_raw      STRING,           -- NM1*IL loop — unvalidated
  date_of_service    STRING,           -- CLM / SV1 — raw format, not DATE
  icd10_dx_1         STRING,           -- HI*ABK — raw [ICD-10](/terms/ICD-10) value
  icd10_dx_2         STRING,
  cpt_code           STRING,           -- SV1 procedure code
  units              STRING,
  charge_amount      STRING,           -- raw string, not DECIMAL
  raw_segment_data   STRING,           -- full EDI segment for audit trail
  ingested_at        TIMESTAMP,
  source_partner_id  STRING
)
USING DELTA
PARTITIONED BY (DATE(ingested_at))
LOCATION '/mnt/delta/bronze/claims_837p';

Note: all fields are STRING in the bronze layer. Type casting and validation happen in silver.

835 Remittance (Bronze)

CREATE TABLE bronze.raw_835 (
  file_name          STRING,
  payer_id           STRING,           -- N1*PR loop
  payee_npi          STRING,           -- N1*PE loop
  check_number       STRING,           -- BPR segment
  payment_date       STRING,
  claim_id_raw       STRING,           -- CLP segment
  claim_status_code  STRING,           -- CLP02: 1=paid, 4=denied
  paid_amount        STRING,
  allowed_amount     STRING,
  patient_control_no STRING,
  adjustment_code    STRING,           -- CAS group code + reason code
  ingested_at        TIMESTAMP,
  source_payer_id    STRING
)
USING DELTA
PARTITIONED BY (DATE(ingested_at))
LOCATION '/mnt/delta/bronze/claims_835';

Silver Layer: Standardized and Validated

Silver layer data has been type-cast, validated, deduplicated, and enriched with standard reference data. PHI masking is applied at this layer for non-privileged consumers.

Claims Header (Silver)

CREATE TABLE silver.claim_header (
  claim_id               STRING      NOT NULL,    -- surrogate key
  enterprise_member_id   STRING      NOT NULL,    -- joined from MPI crosswalk
  billing_provider_npi   CHAR(10),
  rendering_provider_npi CHAR(10),
  service_from_date      DATE        NOT NULL,
  service_to_date        DATE,
  icd10_primary_dx       STRING,                  -- validated against ICD-10 reference
  icd10_dx_2             STRING,
  icd10_dx_3             STRING,
  primary_cpt_code       STRING,
  claim_type             STRING,                  -- PROFESSIONAL, INSTITUTIONAL
  place_of_service_code  STRING,
  adjudication_status    STRING,                  -- PAID, DENIED, PENDED
  paid_amount            DECIMAL(12,2),
  allowed_amount         DECIMAL(12,2),
  member_liability       DECIMAL(12,2),
  denial_reason_code     STRING,
  data_source            STRING,                  -- CLEARINGHOUSE_A, DIRECT_B
  valid_flag             BOOLEAN,                 -- passed silver validation rules
  validation_notes       STRING,
  silver_load_ts         TIMESTAMP
)
USING DELTA
PARTITIONED BY (service_year_month STRING)       -- format: YYYY-MM
LOCATION '/mnt/delta/silver/claim_header'
TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true');

Enable Change Data Feed on silver tables — it allows downstream gold models to process only changed records, dramatically reducing computation cost for incremental updates.


Gold Layer: Aggregated Business Products

Gold layer tables are the analytics-facing outputs. They are rebuilt from silver on a defined schedule and represent business-defined metrics, quality measures, and risk scores.

HCC Risk Score Summary (Gold)

CREATE TABLE gold.hcc_risk_summary (
  enterprise_member_id   STRING      NOT NULL,
  measurement_year       INT         NOT NULL,
  raf_score              DECIMAL(8,4),
  hcc_count              INT,
  top_hcc_code           STRING,
  top_hcc_desc    STRING,
  hcc_list               ARRAY<STRING>,
  data_sources_used      ARRAY<STRING>,  -- CLAIMS, ENCOUNTER, LAB
  score_computed_at      TIMESTAMP,
  cms_model_version      STRING          -- v28, v24, etc.
)
USING DELTA
PARTITIONED BY (measurement_year)
LOCATION '/mnt/delta/gold/hcc_risk_summary';

Use the HCC Calculator to validate RAF score logic independently before running your full population through the gold layer pipeline — catching model version mismatches early.

HEDIS Measure Tracking (Gold)

CREATE TABLE gold.hedis_measure_tracking (
  enterprise_member_id   STRING      NOT NULL,
  measurement_year       INT         NOT NULL,
  measure_id             STRING      NOT NULL,   -- CBP, CDC, BCS, etc.
  in_denominator         BOOLEAN,
  in_numerator           BOOLEAN,
  exclusion_flag         BOOLEAN,
  exclusion_reason       STRING,
  gap_open_flag          BOOLEAN,                -- in denominator, not in numerator
  last_compliant_date    DATE,
  data_source            STRING,                 -- CLAIMS, LAB, PHARMACY
  computed_at            TIMESTAMP
)
USING DELTA
PARTITIONED BY (measurement_year, measure_id)
LOCATION '/mnt/delta/gold/hedis_measure_tracking';

Delta Lake Operational Features

Optimize and Z-Order: Run OPTIMIZE with Z-ORDER on high-cardinality query columns to improve query performance on large claims tables:

OPTIMIZE silver.claim_header
ZORDER BY (enterprise_member_id, service_from_date);

Vacuum: Retain 30 days of history for RADV audit purposes (not the default 7-day Delta retention):

ALTER TABLE silver.claim_header
SET TBLPROPERTIES ('delta.deletedFileRetentionDuration' = 'interval 30 days');

Key Takeaways

  • The bronze layer stores raw EDI data as strings — no type casting, no standardization. Fidelity and auditability are the goals.
  • Silver is where type casting, validation, and MPI joining happen. Enable Change Data Feed on silver tables for efficient incremental gold processing.
  • Time travel in Delta Lake directly addresses RADV audit requirements. Preserve at least 90 days of history on HCC and claims tables.
  • Z-ORDER on enterprise_member_id and service_from_date is the single most impactful query optimization for claims analytics workloads.
  • Use the DDL Converter to translate Delta Lake DDL to Snowflake or BigQuery syntax when moving between platforms.
M

mdatool Team

The mdatool team builds free engineering tools for healthcare data architects, analysts, and engineers working across payer, provider, and life sciences data.

Ready to improve your data architecture?

Free tools for DDL conversion, SQL analysis, naming standards, and more.

Get Started Free