Introduction
A health plan with 2 million members has, on average, between 1.5 and 3 million member records — because the same person enrolled, churned, re-enrolled, changed names, moved, and got entered differently in three different source systems. That is not a data quality problem. It is a master data management failure. MDM in healthcare is not a nice-to-have data engineering exercise; it is the foundation that makes population health analytics, risk adjustment, and care management operationally possible.
This guide covers the architecture and data modeling patterns for the three domains that matter most: patient identity, provider registry, and payer/plan master data.
Domain 1: Patient Identity — The Master Patient Index (MPI)
The Master Patient Index is the authoritative, deduplicated record of every patient or member across all source systems. Every downstream use case — HEDIS measures, HCC risk scores, care gap analysis — depends on an accurate MPI.
The Deduplication Problem
Patient records duplicate for predictable reasons:
- Name variations (Maria vs. María, Jr. vs. omitted suffix)
- Address changes that create new enrollment records
- Multiple coverage periods under different member IDs
- Data entry errors at point of care
A naive approach (match on SSN) fails because SSN is missing, incorrect, or deliberately obscured in roughly 10–15% of records in most health plan systems.
MPI Data Model
CREATE TABLE mdm.patient_master (
enterprise_patient_id VARCHAR(36) NOT NULL, -- golden record ID
first_name VARCHAR(100),
last_name VARCHAR(100),
date_of_birth DATE,
gender_code CHAR(1),
ssn_hash VARCHAR(64), -- SHA-256, not plaintext
primary_address_id BIGINT,
current_coverage_flag BOOLEAN,
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP NOT NULL,
match_confidence_score DECIMAL(5,4), -- 0.0000–1.0000
source_count INT, -- number of source records matched
PRIMARY KEY (enterprise_patient_id)
);
CREATE TABLE mdm.patient_crosswalk (
crosswalk_id BIGINT GENERATED ALWAYS AS IDENTITY,
enterprise_patient_id VARCHAR(36) NOT NULL,
source_system VARCHAR(50) NOT NULL, -- EPIC, FACETS, CLAIMS, etc.
source_patient_id VARCHAR(100) NOT NULL,
match_type VARCHAR(20), -- EXACT, PROBABILISTIC, MANUAL
match_score DECIMAL(5,4),
linked_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (crosswalk_id),
UNIQUE (source_system, source_patient_id)
);
Matching Strategy
Use a probabilistic matching algorithm (Fellegi-Sunter is the standard; commercial tools like IBM InfoSphere QualityStage implement it). Weight matching fields:
| Field | Weight | Notes |
|---|---|---|
| SSN | 0.30 | High weight but often missing/incorrect |
| DOB | 0.25 | Reliable for enrollment; less so for EHR |
| Last name + First name | 0.20 | Account for phonetic variants (Soundex/Metaphone) |
| Address | 0.15 | Use USPS standardized form |
| Phone | 0.10 | High churn; low weight |
Match pairs above a confidence threshold (typically 0.85+) are auto-linked. Pairs in a review band (0.70–0.85) queue for manual stewardship review.
Domain 2: Provider Registry
A provider registry is the authoritative directory of every clinician, facility, and organization that delivers care within your network or claims ecosystem. Broken provider data causes misdirected payments, inaccurate directories, and failed credentialing processes.
Provider Registry Data Model
CREATE TABLE mdm.provider_master (
enterprise_provider_id VARCHAR(36) NOT NULL,
npi CHAR(10) NOT NULL, -- National Provider Identifier
provider_type VARCHAR(20) NOT NULL, -- INDIVIDUAL, ORGANIZATION
first_name VARCHAR(100), -- null for org
last_name VARCHAR(100), -- null for org
organization_name VARCHAR(200), -- null for individual
primary_specialty_code VARCHAR(10), -- NUCC taxonomy
primary_taxonomy_desc VARCHAR(200),
license_state CHAR(2),
license_number VARCHAR(50),
credential_status VARCHAR(20), -- ACTIVE, EXPIRED, SUSPENDED
network_status VARCHAR(20), -- IN_NETWORK, OUT_OF_NETWORK
effective_date DATE,
termination_date DATE,
PRIMARY KEY (enterprise_provider_id),
UNIQUE (npi)
);
The NPI is the universal provider identifier — use it as the deduplication key across source systems. The NPI Lookup tool can validate NPI numbers and retrieve taxonomy classifications from the NPPES registry directly.
Common MDM Failures in Provider Data
- Group NPI vs. Individual NPI confusion: Claims paid to group NPI 1234567890 but attributed to rendering NPI 9876543210. MDM must maintain both and link them.
- Taxonomy code drift: A provider changes specialty but the taxonomy update doesn't propagate from NPPES to the internal registry. Creates incorrect network status.
- Credentialing lag: Provider onboarded to claims before credentialing completes. MDM should gate network activation on credentialing status.
Domain 3: Payer and Plan Master Data
Plan master data governs the products your organization offers or interacts with: benefit plans, formularies, network configurations, and contract terms. In a health plan, inaccurate plan data is a root cause of incorrect member cost-sharing, failed prior authorization checks, and misapplied formulary rules.
Plan Master Data Model
CREATE TABLE mdm.plan_master (
plan_id VARCHAR(36) NOT NULL,
plan_name VARCHAR(200) NOT NULL,
plan_type VARCHAR(20) NOT NULL, -- HMO, PPO, EPO, HDHP, MA, PDP
lob VARCHAR(30) NOT NULL, -- COMMERCIAL, MEDICARE, MEDICAID
cms_contract_number VARCHAR(10), -- for Medicare Advantage
cms_plan_id VARCHAR(10), -- H, R, or S number
effective_date DATE NOT NULL,
termination_date DATE,
formulary_id VARCHAR(36),
network_id VARCHAR(36),
service_area_state CHAR(2),
metal_tier VARCHAR(10), -- BRONZE, SILVER, GOLD, PLATINUM
is_active BOOLEAN NOT NULL DEFAULT TRUE,
PRIMARY KEY (plan_id)
);
MDM Architecture Patterns
Registry style (virtual MDM): A crosswalk table links source IDs to a master ID without physically consolidating records. Fast to implement; requires all consumers to join through the crosswalk.
Consolidation style (physical golden record): A golden record is created by merging source attributes. Best match attribute wins per field. More reliable for analytics but requires ongoing merge maintenance.
Coexistence style: Both source records and the golden record exist and are synchronized in both directions. Most common in healthcare because source systems (Epic, Facets) remain the system of record for clinical and operational data.
Key Takeaways
- Patient MDM is the foundation of population health analytics. Start with probabilistic matching, not exact match on SSN.
- The crosswalk table pattern is the most portable MDM architecture — it works regardless of whether you consolidate records or not.
- NPI is the universal deduplication key for providers. Use the NPI Lookup to validate and enrich provider records from NPPES.
- Plan master data failures cause downstream errors in cost-sharing, prior auth, and formulary logic — not just data quality dashboards.
- MDM is a process, not a one-time project. Merge confidence scores must be reviewed continuously as new records flow in.
mdatool Team
The mdatool team builds free engineering tools for healthcare data architects, analysts, and engineers working across payer, provider, and life sciences data.
Related Guides
Key Terms in This Article
More in Healthcare Data Modeling
AI-Powered Healthcare Data Modeling: Generate Production-Ready Schemas in 30 Seconds
Healthcare data architects spend days designing schemas from scratch — Medicare Advantage claims warehouses alone require 20+ tables, hundreds of columns, and platform-specific syntax. The mdatool AI Data Modeling tool generates a production-ready Star Schema for Snowflake, BigQuery, or Databricks in 30 seconds, with ISO-11179 standard column names built in.
Read moreEnforcing ISO-11179 Healthcare Naming Standards in dbt Projects
Every healthcare data warehouse eventually develops naming drift — DOB in one model, birth_dt in another, member_birth_date in a third. The dbt-healthcare-standards package brings ISO-11179 column naming directly into your dbt project as installable macros and schema tests.
Read moreAI-Powered Healthcare Data Modeling: Generate Production-Ready Schemas in Seconds
Healthcare data models take weeks to design manually — HIPAA constraints, CMS reporting requirements, and ISO-11179 naming standards all have to be applied correctly from the start. AI data modeling changes that. Here is how to generate production-ready schemas for Snowflake, BigQuery, and Databricks in seconds.
Read moreFree Tools
Ready to improve your data architecture?
Free tools for DDL conversion, SQL analysis, naming standards, and more.
Get weekly healthcare data engineering tips
Practical guides on data modeling, SQL standards, and healthcare domain conventions — straight to your inbox.
No spam. Unsubscribe any time.