Introduction
A healthcare data warehouse schema is not a generic engineering problem. It carries regulatory baggage — HIPAA's 18 PHI identifiers, CMS reporting requirements, ISO-11179 column naming standards, and platform-specific DDL quirks across Snowflake, BigQuery, and Databricks. A senior data engineer who knows all of that can take three to four weeks designing a production-ready Medicare Advantage claims schema from scratch. Most engineers take longer.
While ERwin requires a complex setup for schema generation, you can generate clean DDL in seconds using our free converter.
Convert your first 5 DDLs — No Credit Card RequiredAI data modeling compresses that timeline to seconds. Not because it skips the compliance details — but because it knows them. A purpose-built healthcare modeling tool understands that a clm_paid_amt column should be NUMBER(12,2), that member identifiers need HIPAA-compliant handling, and that a CMS-mandated reporting table has a different grain than an operational claims table.
This guide walks through how AI data modeling works for healthcare, shows real generated schemas, and helps you choose the right architecture for your use case.
What Makes Healthcare Data Modeling Different
Healthcare schemas fail in ways that general-purpose data models do not. Here are the four constraints that make healthcare modeling uniquely difficult:
HIPAA Requirements Affect Schema Design
HIPAA does not just govern access — it shapes schema design. The 18 PHI identifiers (member name, DOB, ZIP, MRN, NPI, and 13 others) must be identifiable in your schema so you can apply column-level masking, row-level security, and audit logging. A schema where mbr_dob and mbr_zip_cd are hidden inside a VARIANT column is a HIPAA audit waiting to happen.
Production healthcare schemas label PHI columns explicitly, implement column-level encryption for high-sensitivity identifiers, and structure tables so that de-identification can be applied without breaking downstream joins.
CMS Reporting Requirements Drive Table Structure
Medicare Advantage plans submit monthly RAPS/EDPS encounter data to CMS. That submission requires a specific data grain: one row per diagnosis per encounter per member per period. If your claims schema was designed for internal analytics (one row per claim line), you need a transformation layer to produce CMS output — or you design the underlying schema to support both grains from day one.
The same tension exists for HEDIS, Stars measure reporting, and 837 transaction generation. A schema that works for internal reporting frequently breaks at the regulatory submission boundary.
ISO-11179 Column Naming Standards
ISO-11179 defines how healthcare column names should be structured: object class + property + representation. mbr_dob_dt, clm_paid_amt, prvdr_npi_nbr. The standard matters not because any regulator checks your DDL, but because consistent naming is how teams avoid the column proliferation problem — where claim_date, clm_dt, service_date, and svc_dt all exist in the same warehouse and mean different things.
See the claims glossary and clinical glossary for the full standard abbreviation sets used in production healthcare schemas.
Real Example: Claims Table Column Names
A standards-compliant claims header row uses names like these:
clm_id VARCHAR(50) -- claim identifier
mbr_id VARCHAR(50) -- member identifier (PHI)
clm_svc_dt DATE -- service date
clm_paid_amt NUMBER(12,2) -- paid amount
clm_billed_amt NUMBER(12,2) -- billed amount
prvdr_npi VARCHAR(10) -- rendering provider NPI
clm_sts_cd VARCHAR(20) -- claim status code
icd_diag_cd_1 VARCHAR(7) -- primary diagnosis ([ICD-10](/terms/icd-10)-CM)
Non-standard names — claim_amount, diagnosis, status — create ambiguity at scale and fail ISO-11179 validation.
How AI Data Modeling Works for Healthcare
The AI data modeling tool takes your inputs — domain, architecture, platform, and use case — and generates DDL that is already compliant with the constraints above.
Domain-specific context: The tool knows that a pharmacy claims table needs NDC code columns (ndc_cd VARCHAR(11)), that a clinical encounters table needs place of service codes, and that a member eligibility table needs effective date ranges with explicit gap handling.
Architecture patterns: You choose Star Schema, Data Vault 2.0, or Lakehouse (Delta Lake), and the tool generates the correct structural pattern — fact and dimension tables for Star Schema, hubs, links, and satellites for Data Vault, and medallion-layer Bronze/Silver/Gold tables for Lakehouse.
Platform-specific DDL: Snowflake, BigQuery, and Databricks each have DDL syntax differences that matter in production. Snowflake uses TIMESTAMP_NTZ for timezone-free timestamps and NUMBER(12,2) for currency. BigQuery uses NUMERIC and DATETIME. Databricks uses DECIMAL and supports COMMENT clauses in CREATE TABLE. The generated DDL is immediately executable on your target platform without manual correction.
Free Tool
Convert this DDL to Snowflake, BigQuery, or PostgreSQL instantly →
Input → Output example: Entering "Medicare Advantage claims fact table, Snowflake, Star Schema" produces a complete, runnable Snowflake DDL block — no hallucinated column types, no missing foreign keys, no non-standard names.
You can convert existing schemas between platforms using the DDL Converter if you are migrating from one warehouse to another.
Star Schema Example — Medicare Advantage Claims
Star Schema remains the standard for traditional healthcare data warehouses and reporting environments. Here is a generated Star Schema for Medicare Advantage claims — the kind you would present in a payer data warehouse design review.
-- ── Fact table: one row per claim header ─────────────────────────────
CREATE TABLE fact_claim_header (
clm_sk NUMBER NOT NULL, -- surrogate key
mbr_sk NUMBER NOT NULL, -- FK → dim_member
prvdr_sk NUMBER NOT NULL, -- FK → dim_provider
dt_sk NUMBER NOT NULL, -- FK → dim_date (service date)
clm_id VARCHAR(50) NOT NULL,
clm_svc_dt DATE NOT NULL,
clm_paid_dt DATE,
clm_billed_amt NUMBER(12,2),
clm_allowed_amt NUMBER(12,2),
clm_paid_amt NUMBER(12,2),
clm_sts_cd VARCHAR(20),
icd_diag_cd_1 VARCHAR(7),
icd_diag_cd_2 VARCHAR(7),
icd_diag_cd_3 VARCHAR(7),
drg_cd VARCHAR(10),
clm_type_cd VARCHAR(20),
load_dt TIMESTAMP_NTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
CONSTRAINT pk_fact_claim_header PRIMARY KEY (clm_sk)
);
-- ── Dimension: member ─────────────────────────────────────────────────
CREATE TABLE dim_member (
mbr_sk NUMBER NOT NULL,
mbr_id VARCHAR(50) NOT NULL, -- PHI: mask in non-prod
mbr_last_nm VARCHAR(100), -- PHI
mbr_first_nm VARCHAR(100), -- PHI
mbr_dob_dt DATE, -- PHI
mbr_gndr_cd CHAR(1),
mbr_zip_cd VARCHAR(10), -- PHI
mbr_cnty_cd VARCHAR(10),
mbr_state_cd CHAR(2),
plan_id VARCHAR(20),
eff_start_dt DATE NOT NULL,
eff_end_dt DATE,
curr_row_ind BOOLEAN NOT NULL DEFAULT TRUE, -- SCD Type 2
CONSTRAINT pk_dim_member PRIMARY KEY (mbr_sk)
);
-- ── Dimension: provider ───────────────────────────────────────────────
CREATE TABLE dim_provider (
prvdr_sk NUMBER NOT NULL,
prvdr_npi VARCHAR(10),
prvdr_last_nm VARCHAR(100),
prvdr_first_nm VARCHAR(100),
prvdr_org_nm VARCHAR(255),
prvdr_type_cd VARCHAR(20),
prvdr_spclty_cd VARCHAR(10),
prvdr_state_cd CHAR(2),
prvdr_zip_cd VARCHAR(10),
eff_start_dt DATE NOT NULL,
eff_end_dt DATE,
curr_row_ind BOOLEAN NOT NULL DEFAULT TRUE,
CONSTRAINT pk_dim_provider PRIMARY KEY (prvdr_sk)
);
-- ── Dimension: date ───────────────────────────────────────────────────
CREATE TABLE dim_date (
dt_sk NUMBER NOT NULL,
full_dt DATE NOT NULL,
yr_nbr SMALLINT,
qtr_nbr SMALLINT,
mo_nbr SMALLINT,
wk_nbr SMALLINT,
day_of_wk_nbr SMALLINT,
CONSTRAINT pk_dim_date PRIMARY KEY (dt_sk)
);
Tired of legacy complexity and high pricing?
mdatool offers instant DDL conversion, HL7 support, and AI-driven data modeling for a fraction of the cost of ER/Studio or ERwin.
Try mdatool for FreeDesign decisions worth noting: dim_member uses SCD Type 2 (eff_start_dt, eff_end_dt, curr_row_ind) because member demographics change — ZIP codes update, plan assignments change — and payers need to reconstruct who held which coverage at the time a claim was paid. fact_claim_header stores three diagnosis codes as flat columns rather than a separate claim-diagnosis table because the vast majority of reporting queries need the primary diagnosis in the same row scan, and a separate diagnosis table adds a join at no analytical benefit for most use cases.
For a deeper look at the claims data model, see Healthcare Claims Data Model: Complete SQL Schema.
Data Vault 2.0 for HIPAA Audit Trails
When your primary requirements are HIPAA compliance, full audit history, and source-system traceability — rather than query performance — Data Vault 2.0 is the better architectural choice. Its immutable loading pattern and hub-link-satellite structure make it naturally suited to HIPAA audit trail requirements.
-- ── Hub: member master record ─────────────────────────────────────────
CREATE TABLE hub_member (
mbr_hash_key CHAR(32) NOT NULL, -- MD5 of mbr_id
mbr_id VARCHAR(50) NOT NULL,
load_dt TIMESTAMP_NTZ NOT NULL,
rec_src VARCHAR(50) NOT NULL, -- source system
CONSTRAINT pk_hub_member PRIMARY KEY (mbr_hash_key)
);
-- ── Hub: claim ────────────────────────────────────────────────────────
CREATE TABLE hub_claim (
clm_hash_key CHAR(32) NOT NULL,
clm_id VARCHAR(50) NOT NULL,
load_dt TIMESTAMP_NTZ NOT NULL,
rec_src VARCHAR(50) NOT NULL,
CONSTRAINT pk_hub_claim PRIMARY KEY (clm_hash_key)
);
-- ── Link: member ↔ claim relationship ────────────────────────────────
CREATE TABLE lnk_member_claim (
lnk_hash_key CHAR(32) NOT NULL,
mbr_hash_key CHAR(32) NOT NULL,
clm_hash_key CHAR(32) NOT NULL,
load_dt TIMESTAMP_NTZ NOT NULL,
rec_src VARCHAR(50) NOT NULL,
CONSTRAINT pk_lnk_member_claim PRIMARY KEY (lnk_hash_key)
);
-- ── Satellite: member demographics (full history, append-only) ────────
CREATE TABLE sat_member_demographics (
mbr_hash_key CHAR(32) NOT NULL,
load_dt TIMESTAMP_NTZ NOT NULL,
load_end_dt TIMESTAMP_NTZ, -- NULL = current
hash_diff CHAR(32) NOT NULL, -- detects row changes
mbr_last_nm VARCHAR(100), -- PHI
mbr_first_nm VARCHAR(100), -- PHI
mbr_dob_dt DATE, -- PHI
mbr_gndr_cd CHAR(1),
mbr_zip_cd VARCHAR(10), -- PHI
rec_src VARCHAR(50) NOT NULL,
CONSTRAINT pk_sat_member_demo PRIMARY KEY (mbr_hash_key, load_dt)
);
Why Data Vault for healthcare compliance: Every row in every satellite is immutable — no UPDATEs, only INSERTs. load_dt and load_end_dt provide a complete timeline of every demographic change for every member. rec_src tags where the data originated, which is critical during HIPAA breach investigations when you need to prove exactly when a PHI value entered your system and from which source feed.
Hash keys (mbr_hash_key) are generated from business keys using MD5 or SHA-1. They make Data Vault load pipelines deterministic and idempotent — you can re-run any load without creating duplicate records.
For risk adjustment context on how member data feeds into HCC scoring, see HCC Risk Adjustment Data Model.
📊Free Tool
Calculate RAF scores with our free HCC Calculator →
Choosing the Right Architecture
No single architecture is right for every healthcare data use case. Use this decision table:
| Use Case | Recommended Architecture | Platform |
|---|---|---|
| Traditional payer DW, BI reporting | Star Schema | Snowflake, Redshift |
| HIPAA audit trail, source traceability | Data Vault 2.0 | Snowflake, Synapse |
| Streaming claims, real-time eligibility | Lakehouse (Bronze/Silver/Gold) | Databricks Delta Lake |
| FHIR-native clinical data | Document model | MongoDB, Cosmos DB |
| Medicare Advantage risk adjustment | Star Schema + HCC mart | Snowflake, BigQuery |
| Multi-source member MDM | Data Vault 2.0 | Snowflake |
Star Schema is the right default for health plans building their first data warehouse or migrating from a legacy SAS/flat-file environment. Query performance is predictable, BI tools (Tableau, Power BI, Looker) connect cleanly to fact-dimension structures, and the model is straightforward to explain in a data governance review.
Data Vault 2.0 earns its complexity when you have strict HIPAA audit requirements, multiple conflicting source systems feeding the same entities, or a history of data quality investigations that turned into compliance findings because you could not reconstruct where bad data came from.
Lakehouse on Databricks is the right architecture when you need streaming claims ingestion alongside batch batch processing, or when your analytics team is already running machine learning pipelines on Delta Lake and you want a single platform rather than two separate systems.
The Healthcare Analytics Guide covers how these architectures connect to downstream analytics workflows.
Frequently Asked Questions
What is AI healthcare data modeling?
AI healthcare data modeling is the use of AI to generate production-ready database schemas for healthcare data — claims, clinical, pharmacy, member, and provider domains. Unlike general-purpose schema generators, a purpose-built healthcare tool applies domain-specific rules: ISO-11179 column naming standards, HIPAA PHI labeling, correct SQL data types for healthcare fields (DATE for service dates, NUMBER(12,2) for financial amounts, VARCHAR(10) for NPI numbers), and platform-specific DDL syntax for Snowflake, BigQuery, and Databricks.
Which database platforms are supported?
The AI data modeling tool generates DDL for Snowflake, BigQuery, Databricks (Delta Lake), PostgreSQL, and SQL Server. Each output uses platform-correct syntax — Snowflake's TIMESTAMP_NTZ, BigQuery's NUMERIC, Databricks COMMENT clauses — so the DDL is immediately executable without manual corrections.
Does it follow HIPAA naming standards?
Yes. Generated schemas follow ISO-11179 naming conventions and explicitly label PHI columns (member name, DOB, ZIP, MRN, NPI) with inline comments in the DDL output. The tool does not apply encryption itself — that is a Snowflake Dynamic Data Masking or BigQuery policy tag decision — but it marks which columns require those controls so your security team can apply them consistently.
What architectures can it generate?
The tool generates four architecture patterns: Star Schema (fact + dimension tables with SCD Type 2), Data Vault 2.0 (hubs, links, and satellites with hash keys and immutable loading), Lakehouse medallion (Bronze raw ingestion, Silver curated, Gold aggregated), and operational normalized schemas for transactional systems. You select the pattern and domain; the tool handles the structural rules for each.
How accurate is AI-generated healthcare DDL?
AI-generated DDL is accurate for structure, naming, and data types — the deterministic parts of schema design. It correctly knows that a clm_paid_amt is NUMBER(12,2) not VARCHAR, that an NPI is VARCHAR(10), and that a Star Schema fact table needs surrogate keys and foreign keys to dimension tables. Where human review is still needed: business-rule constraints (what makes a claim eligible for your specific HCC model), organization-specific primary key strategies, and custom audit trail requirements that go beyond standard patterns. Treat AI-generated DDL as a high-quality first draft that eliminates the blank-page problem, not a final artifact that bypasses schema review.
mdatool Team
The mdatool team builds free engineering tools for healthcare data architects, analysts, and engineers working across payer, provider, and life sciences data.
Related Guides
More in Healthcare Data Modeling
Enforcing ISO-11179 Healthcare Naming Standards in dbt Projects
Every healthcare data warehouse eventually develops naming drift — DOB in one model, birth_dt in another, member_birth_date in a third. The dbt-healthcare-standards package brings ISO-11179 column naming directly into your dbt project as installable macros and schema tests.
Read moreHealthcare Claims Data Model: Complete SQL Schema for Payers
The complete SQL schema for a payer-side healthcare claims data warehouse — including claim header, claim line, adjudication, remittance, and provider tables with DDL for Snowflake and BigQuery.
Read moreHEDIS Measure Data Architecture: Building a Compliant Quality Reporting Pipeline
HEDIS measure calculation is deceptively complex. NCQA technical specifications are hundreds of pages long, source data comes from four different systems, and one logic error can affect thousands of members. Here is how to build a pipeline that holds up to NCQA audit.
Read moreFree Tools
Ready to improve your data architecture?
Free tools for DDL conversion, SQL analysis, naming standards, and more.
Get weekly healthcare data engineering tips
Practical guides on data modeling, SQL standards, and healthcare domain conventions — straight to your inbox.
No spam. Unsubscribe any time.