Introduction
The medallion architecture — bronze, silver, gold — has become the standard lakehouse pattern for good reason. For healthcare payers, it maps cleanly to the reality of claims data: messy, vendor-specific raw files arrive at the bronze layer; standardized, validated datasets sit in silver; [HEDIS](/terms/HEDIS) measures, [HCC](/terms/HCC) risk scores, and actuarial outputs live in gold. Delta Lake provides the ACID transactions, time travel, and schema enforcement that make this pattern production-grade for PHI-bearing healthcare data.
📊Free Tool
Calculate RAF scores with our free HCC Calculator →
This guide walks through the architecture layer by layer, with specific attention to healthcare payer data — 837 and 835 files, member enrollment, provider registry, and the quality and risk models that consume them.
Why Delta Lake for Healthcare Data
Before getting into schema design, it is worth being explicit about why Delta Lake is the right storage layer for healthcare payers:
ACID transactions: Healthcare pipelines process 835 remittance files that arrive late, out of order, and sometimes duplicate. ACID transactions ensure that a partial load of a 835 file does not corrupt the claims payment history. Either the transaction commits completely or it does not commit at all.
Schema enforcement: Delta Lake enforces schema at write time. A malformed 837 file with an unexpected column type fails loudly at the bronze layer rather than silently corrupting the silver table.
Time travel: [RADV](/terms/RADV) audits require you to produce the data as it existed at a specific point in time. Delta's time travel (via version number or timestamp) makes this a query, not a reconstruction:
# Reconstruct HCC submission data as of the RADV audit cutoff date
hcc_at_audit = spark.read.format("delta") .option("timestampAsOf", "2024-06-30T23:59:59Z") .load("/mnt/delta/gold/hcc_submission")
Schema evolution: Healthcare data evolves. New 837 loop segments get added. CMS introduces new encounter data elements. Delta Lake handles schema evolution without rewriting historical data.
Bronze Layer: Raw Ingestion
The bronze layer stores raw data exactly as it arrives — no transformation, no standardization, no column renaming. The goal is fidelity and auditability.
837 Professional Claims (Bronze)
CREATE TABLE bronze.raw_837p (
file_name STRING, -- source file identifier
transaction_set STRING, -- 837P
interchange_id STRING, -- ISA control number
billing_npi STRING, -- NM1*85 loop
rendering_npi STRING, -- NM1*82 loop
member_id_raw STRING, -- NM1*IL loop — unvalidated
date_of_service STRING, -- CLM / SV1 — raw format, not DATE
icd10_dx_1 STRING, -- HI*ABK — raw [ICD-10](/terms/icd-10) value
icd10_dx_2 STRING,
cpt_code STRING, -- SV1 procedure code
units STRING,
charge_amount STRING, -- raw string, not DECIMAL
raw_segment_data STRING, -- full EDI segment for audit trail
ingested_at TIMESTAMP,
source_partner_id STRING
)
USING DELTA
PARTITIONED BY (DATE(ingested_at))
LOCATION '/mnt/delta/bronze/claims_837p';
Note: all fields are STRING in the bronze layer. Type casting and validation happen in silver.
835 Remittance (Bronze)
CREATE TABLE bronze.raw_835 (
file_name STRING,
payer_id STRING, -- N1*PR loop
payee_npi STRING, -- N1*PE loop
check_number STRING, -- BPR segment
payment_date STRING,
claim_id_raw STRING, -- CLP segment
claim_status_code STRING, -- CLP02: 1=paid, 4=denied
paid_amount STRING,
allowed_amount STRING,
patient_control_no STRING,
adjustment_code STRING, -- CAS group code + reason code
ingested_at TIMESTAMP,
source_payer_id STRING
)
USING DELTA
PARTITIONED BY (DATE(ingested_at))
LOCATION '/mnt/delta/bronze/claims_835';
Silver Layer: Standardized and Validated
Silver layer data has been type-cast, validated, deduplicated, and enriched with standard reference data. PHI masking is applied at this layer for non-privileged consumers.
Claims Header (Silver)
CREATE TABLE silver.claim_header (
claim_id STRING NOT NULL, -- surrogate key
enterprise_member_id STRING NOT NULL, -- joined from MPI crosswalk
billing_provider_npi CHAR(10),
rendering_provider_npi CHAR(10),
service_from_date DATE NOT NULL,
service_to_date DATE,
icd10_primary_dx STRING, -- validated against ICD-10 reference
icd10_dx_2 STRING,
icd10_dx_3 STRING,
primary_cpt_code STRING,
claim_type STRING, -- PROFESSIONAL, INSTITUTIONAL
place_of_service_code STRING,
adjudication_status STRING, -- PAID, DENIED, PENDED
paid_amount DECIMAL(12,2),
allowed_amount DECIMAL(12,2),
member_liability DECIMAL(12,2),
denial_reason_code STRING,
data_source STRING, -- CLEARINGHOUSE_A, DIRECT_B
valid_flag BOOLEAN, -- passed silver validation rules
validation_notes STRING,
silver_load_ts TIMESTAMP
)
USING DELTA
PARTITIONED BY (service_year_month STRING) -- format: YYYY-MM
LOCATION '/mnt/delta/silver/claim_header'
TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true');
Enable Change Data Feed on silver tables — it allows downstream gold models to process only changed records, dramatically reducing computation cost for incremental updates.
Gold Layer: Aggregated Business Products
Gold layer tables are the analytics-facing outputs. They are rebuilt from silver on a defined schedule and represent business-defined metrics, quality measures, and risk scores.
HCC Risk Score Summary (Gold)
CREATE TABLE gold.hcc_risk_summary (
enterprise_member_id STRING NOT NULL,
measurement_year INT NOT NULL,
raf_score DECIMAL(8,4),
hcc_count INT,
top_hcc_code STRING,
top_hcc_desc STRING,
hcc_list ARRAY<STRING>,
data_sources_used ARRAY<STRING>, -- CLAIMS, ENCOUNTER, LAB
score_computed_at TIMESTAMP,
cms_model_version STRING -- v28, v24, etc.
)
USING DELTA
PARTITIONED BY (measurement_year)
LOCATION '/mnt/delta/gold/hcc_risk_summary';
Use the HCC Calculator to validate RAF score logic independently before running your full population through the gold layer pipeline — catching model version mismatches early.
HEDIS Measure Tracking (Gold)
CREATE TABLE gold.hedis_measure_tracking (
enterprise_member_id STRING NOT NULL,
measurement_year INT NOT NULL,
measure_id STRING NOT NULL, -- CBP, CDC, BCS, etc.
in_denominator BOOLEAN,
in_numerator BOOLEAN,
exclusion_flag BOOLEAN,
exclusion_reason STRING,
gap_open_flag BOOLEAN, -- in denominator, not in numerator
last_compliant_date DATE,
data_source STRING, -- CLAIMS, LAB, PHARMACY
computed_at TIMESTAMP
)
USING DELTA
PARTITIONED BY (measurement_year, measure_id)
LOCATION '/mnt/delta/gold/hedis_measure_tracking';
Delta Lake Operational Features
Optimize and Z-Order: Run OPTIMIZE with Z-ORDER on high-cardinality query columns to improve query performance on large claims tables:
OPTIMIZE silver.claim_header
ZORDER BY (enterprise_member_id, service_from_date);
Vacuum: Retain 30 days of history for RADV audit purposes (not the default 7-day Delta retention):
ALTER TABLE silver.claim_header
SET TBLPROPERTIES ('delta.deletedFileRetentionDuration' = 'interval 30 days');
Key Takeaways
- The bronze layer stores raw EDI data as strings — no type casting, no standardization. Fidelity and auditability are the goals.
- Silver is where type casting, validation, and MPI joining happen. Enable Change Data Feed on silver tables for efficient incremental gold processing.
- Time travel in Delta Lake directly addresses RADV audit requirements. Preserve at least 90 days of history on HCC and claims tables.
- Z-ORDER on
enterprise_member_idandservice_from_dateis the single most impactful query optimization for claims analytics workloads. - Use the DDL Converter to translate Delta Lake DDL to Snowflake or BigQuery syntax when moving between platforms.
mdatool Team
The mdatool team builds free engineering tools for healthcare data architects, analysts, and engineers working across payer, provider, and life sciences data.
Related Guides
Key Terms in This Article
More in Data Architecture
Azure Synapse vs Snowflake for Healthcare Data Architecture: Which Platform Fits Your Team?
Azure Synapse Analytics and Snowflake both promise a unified cloud data platform — but they make different architectural bets that matter enormously in healthcare. This guide compares them across HIPAA compliance, FHIR integration, PHI governance, cost model, and team fit, with concrete SQL examples and a decision framework built for healthcare data engineers.
Read moreOracle vs Databricks for Healthcare Data Architecture: Which Platform Should You Choose?
Oracle brings four decades of enterprise database maturity, deep EHR integration, and a proven HIPAA compliance story. Databricks brings a unified lakehouse, native AI/ML pipelines, and the ability to handle FHIR, HL7, and unstructured clinical data at scale. This guide breaks down which platform wins in each healthcare scenario — and when you need both.
Read moreTelehealth Data Architecture: Complete Guide for Data Engineers (2026)
A complete guide to building a telehealth data architecture — core schema design, HL7 and FHIR integration, HIPAA compliance, HCC risk adjustment, and the common mistakes that cause claim denials.
Read moreReady to improve your data architecture?
Free tools for DDL conversion, SQL analysis, naming standards, and more.
Get weekly healthcare data engineering tips
Practical guides on data modeling, SQL standards, and healthcare domain conventions — straight to your inbox.
No spam. Unsubscribe any time.