Introduction
The medallion architecture — bronze, silver, gold — has become the standard lakehouse pattern for good reason. For healthcare payers, it maps cleanly to the reality of claims data: messy, vendor-specific raw files arrive at the bronze layer; standardized, validated datasets sit in silver; [HEDIS](/terms/HEDIS) measures, [HCC](/terms/HCC) risk scores, and actuarial outputs live in gold. Delta Lake provides the ACID transactions, time travel, and schema enforcement that make this pattern production-grade for PHI-bearing healthcare data.
This guide walks through the architecture layer by layer, with specific attention to healthcare payer data — 837 and 835 files, member enrollment, provider registry, and the quality and risk models that consume them.
Why Delta Lake for Healthcare Data
Before getting into schema design, it is worth being explicit about why Delta Lake is the right storage layer for healthcare payers:
ACID transactions: Healthcare pipelines process 835 remittance files that arrive late, out of order, and sometimes duplicate. ACID transactions ensure that a partial load of a 835 file does not corrupt the claims payment history. Either the transaction commits completely or it does not commit at all.
Schema enforcement: Delta Lake enforces schema at write time. A malformed 837 file with an unexpected column type fails loudly at the bronze layer rather than silently corrupting the silver table.
Time travel: [RADV](/terms/RADV) audits require you to produce the data as it existed at a specific point in time. Delta's time travel (via version number or timestamp) makes this a query, not a reconstruction:
# Reconstruct HCC submission data as of the RADV audit cutoff date hcc_at_audit = spark.read.format("delta") .option("timestampAsOf", "2024-06-30T23:59:59Z") .load("/mnt/delta/gold/hcc_submission")
Schema evolution: Healthcare data evolves. New 837 loop segments get added. CMS introduces new encounter data elements. Delta Lake handles schema evolution without rewriting historical data.
Bronze Layer: Raw Ingestion
The bronze layer stores raw data exactly as it arrives — no transformation, no standardization, no column renaming. The goal is fidelity and auditability.
837 Professional Claims (Bronze)
CREATE TABLE bronze.raw_837p ( file_name STRING, -- source file identifier transaction_set STRING, -- 837P interchange_id STRING, -- ISA control number billing_npi STRING, -- NM1*85 loop rendering_npi STRING, -- NM1*82 loop member_id_raw STRING, -- NM1*IL loop — unvalidated date_of_service STRING, -- CLM / SV1 — raw format, not DATE icd10_dx_1 STRING, -- HI*ABK — raw [ICD-10](/terms/ICD-10) value icd10_dx_2 STRING, cpt_code STRING, -- SV1 procedure code units STRING, charge_amount STRING, -- raw string, not DECIMAL raw_segment_data STRING, -- full EDI segment for audit trail ingested_at TIMESTAMP, source_partner_id STRING ) USING DELTA PARTITIONED BY (DATE(ingested_at)) LOCATION '/mnt/delta/bronze/claims_837p';
Note: all fields are STRING in the bronze layer. Type casting and validation happen in silver.
835 Remittance (Bronze)
CREATE TABLE bronze.raw_835 ( file_name STRING, payer_id STRING, -- N1*PR loop payee_npi STRING, -- N1*PE loop check_number STRING, -- BPR segment payment_date STRING, claim_id_raw STRING, -- CLP segment claim_status_code STRING, -- CLP02: 1=paid, 4=denied paid_amount STRING, allowed_amount STRING, patient_control_no STRING, adjustment_code STRING, -- CAS group code + reason code ingested_at TIMESTAMP, source_payer_id STRING ) USING DELTA PARTITIONED BY (DATE(ingested_at)) LOCATION '/mnt/delta/bronze/claims_835';
Silver Layer: Standardized and Validated
Silver layer data has been type-cast, validated, deduplicated, and enriched with standard reference data. PHI masking is applied at this layer for non-privileged consumers.
Claims Header (Silver)
CREATE TABLE silver.claim_header ( claim_id STRING NOT NULL, -- surrogate key enterprise_member_id STRING NOT NULL, -- joined from MPI crosswalk billing_provider_npi CHAR(10), rendering_provider_npi CHAR(10), service_from_date DATE NOT NULL, service_to_date DATE, icd10_primary_dx STRING, -- validated against ICD-10 reference icd10_dx_2 STRING, icd10_dx_3 STRING, primary_cpt_code STRING, claim_type STRING, -- PROFESSIONAL, INSTITUTIONAL place_of_service_code STRING, adjudication_status STRING, -- PAID, DENIED, PENDED paid_amount DECIMAL(12,2), allowed_amount DECIMAL(12,2), member_liability DECIMAL(12,2), denial_reason_code STRING, data_source STRING, -- CLEARINGHOUSE_A, DIRECT_B valid_flag BOOLEAN, -- passed silver validation rules validation_notes STRING, silver_load_ts TIMESTAMP ) USING DELTA PARTITIONED BY (service_year_month STRING) -- format: YYYY-MM LOCATION '/mnt/delta/silver/claim_header' TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true');
Enable Change Data Feed on silver tables — it allows downstream gold models to process only changed records, dramatically reducing computation cost for incremental updates.
Gold Layer: Aggregated Business Products
Gold layer tables are the analytics-facing outputs. They are rebuilt from silver on a defined schedule and represent business-defined metrics, quality measures, and risk scores.
HCC Risk Score Summary (Gold)
CREATE TABLE gold.hcc_risk_summary ( enterprise_member_id STRING NOT NULL, measurement_year INT NOT NULL, raf_score DECIMAL(8,4), hcc_count INT, top_hcc_code STRING, top_hcc_desc STRING, hcc_list ARRAY<STRING>, data_sources_used ARRAY<STRING>, -- CLAIMS, ENCOUNTER, LAB score_computed_at TIMESTAMP, cms_model_version STRING -- v28, v24, etc. ) USING DELTA PARTITIONED BY (measurement_year) LOCATION '/mnt/delta/gold/hcc_risk_summary';
Use the HCC Calculator to validate RAF score logic independently before running your full population through the gold layer pipeline — catching model version mismatches early.
HEDIS Measure Tracking (Gold)
CREATE TABLE gold.hedis_measure_tracking ( enterprise_member_id STRING NOT NULL, measurement_year INT NOT NULL, measure_id STRING NOT NULL, -- CBP, CDC, BCS, etc. in_denominator BOOLEAN, in_numerator BOOLEAN, exclusion_flag BOOLEAN, exclusion_reason STRING, gap_open_flag BOOLEAN, -- in denominator, not in numerator last_compliant_date DATE, data_source STRING, -- CLAIMS, LAB, PHARMACY computed_at TIMESTAMP ) USING DELTA PARTITIONED BY (measurement_year, measure_id) LOCATION '/mnt/delta/gold/hedis_measure_tracking';
Delta Lake Operational Features
Optimize and Z-Order: Run OPTIMIZE with Z-ORDER on high-cardinality query columns to improve query performance on large claims tables:
OPTIMIZE silver.claim_header ZORDER BY (enterprise_member_id, service_from_date);
Vacuum: Retain 30 days of history for RADV audit purposes (not the default 7-day Delta retention):
ALTER TABLE silver.claim_header SET TBLPROPERTIES ('delta.deletedFileRetentionDuration' = 'interval 30 days');
Key Takeaways
- The bronze layer stores raw EDI data as strings — no type casting, no standardization. Fidelity and auditability are the goals.
- Silver is where type casting, validation, and MPI joining happen. Enable Change Data Feed on silver tables for efficient incremental gold processing.
- Time travel in Delta Lake directly addresses RADV audit requirements. Preserve at least 90 days of history on HCC and claims tables.
- Z-ORDER on
enterprise_member_idandservice_from_dateis the single most impactful query optimization for claims analytics workloads. - Use the DDL Converter to translate Delta Lake DDL to Snowflake or BigQuery syntax when moving between platforms.
mdatool Team
The mdatool team builds free engineering tools for healthcare data architects, analysts, and engineers working across payer, provider, and life sciences data.
Related Guides
Key Terms in This Article
More in Data Architecture
TEFCA and Data Architecture: What Health Systems Need to Build Now
TEFCA is now operational. Qualified Health Information Networks are live. If your health system or payer is not actively planning TEFCA participation, you are behind. Here is what the architecture requires.
Read moreReal-Time vs Batch Processing for Healthcare Claims: Architecture Decision Guide
Not every healthcare claims use case requires real-time processing — and treating them all the same wastes resources and adds complexity. Here is the decision framework for choosing the right architecture.
Read moreEHR Data Integration Architecture: Epic, Cerner, and Oracle Health Compared
Epic, Cerner, and Oracle Health are the three dominant EHR systems — and each requires a different integration strategy. Here is what actually works for extracting clinical data into your warehouse.
Read moreReady to improve your data architecture?
Free tools for DDL conversion, SQL analysis, naming standards, and more.