Introduction
A health plan with 2 million members has, on average, between 1.5 and 3 million member records — because the same person enrolled, churned, re-enrolled, changed names, moved, and got entered differently in three different source systems. That is not a data quality problem. It is a master data management failure. MDM in healthcare is not a nice-to-have data engineering exercise; it is the foundation that makes population health analytics, risk adjustment, and care management operationally possible.
This guide covers the architecture and data modeling patterns for the three domains that matter most: patient identity, provider registry, and payer/plan master data.
Domain 1: Patient Identity — The Master Patient Index (MPI)
The Master Patient Index is the authoritative, deduplicated record of every patient or member across all source systems. Every downstream use case — HEDIS measures, HCC risk scores, care gap analysis — depends on an accurate MPI.
The Deduplication Problem
Patient records duplicate for predictable reasons:
- Name variations (Maria vs. María, Jr. vs. omitted suffix)
- Address changes that create new enrollment records
- Multiple coverage periods under different member IDs
- Data entry errors at point of care
A naive approach (match on SSN) fails because SSN is missing, incorrect, or deliberately obscured in roughly 10–15% of records in most health plan systems.
MPI Data Model
CREATE TABLE mdm.patient_master ( enterprise_patient_id VARCHAR(36) NOT NULL, -- golden record ID first_name VARCHAR(100), last_name VARCHAR(100), date_of_birth DATE, gender_code CHAR(1), ssn_hash VARCHAR(64), -- SHA-256, not plaintext primary_address_id BIGINT, current_coverage_flag BOOLEAN, created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP NOT NULL, match_confidence_score DECIMAL(5,4), -- 0.0000–1.0000 source_count INT, -- number of source records matched PRIMARY KEY (enterprise_patient_id) ); CREATE TABLE mdm.patient_crosswalk ( crosswalk_id BIGINT GENERATED ALWAYS AS IDENTITY, enterprise_patient_id VARCHAR(36) NOT NULL, source_system VARCHAR(50) NOT NULL, -- EPIC, FACETS, CLAIMS, etc. source_patient_id VARCHAR(100) NOT NULL, match_type VARCHAR(20), -- EXACT, PROBABILISTIC, MANUAL match_score DECIMAL(5,4), linked_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (crosswalk_id), UNIQUE (source_system, source_patient_id) );
Matching Strategy
Use a probabilistic matching algorithm (Fellegi-Sunter is the standard; commercial tools like IBM InfoSphere QualityStage implement it). Weight matching fields:
| Field | Weight | Notes |
|---|---|---|
| SSN | 0.30 | High weight but often missing/incorrect |
| DOB | 0.25 | Reliable for enrollment; less so for EHR |
| Last name + First name | 0.20 | Account for phonetic variants (Soundex/Metaphone) |
| Address | 0.15 | Use USPS standardized form |
| Phone | 0.10 | High churn; low weight |
Match pairs above a confidence threshold (typically 0.85+) are auto-linked. Pairs in a review band (0.70–0.85) queue for manual stewardship review.
Domain 2: Provider Registry
A provider registry is the authoritative directory of every clinician, facility, and organization that delivers care within your network or claims ecosystem. Broken provider data causes misdirected payments, inaccurate directories, and failed credentialing processes.
Provider Registry Data Model
CREATE TABLE mdm.provider_master ( enterprise_provider_id VARCHAR(36) NOT NULL, npi CHAR(10) NOT NULL, -- National Provider Identifier provider_type VARCHAR(20) NOT NULL, -- INDIVIDUAL, ORGANIZATION first_name VARCHAR(100), -- null for org last_name VARCHAR(100), -- null for org organization_name VARCHAR(200), -- null for individual primary_specialty_code VARCHAR(10), -- NUCC taxonomy primary_taxonomy_desc VARCHAR(200), license_state CHAR(2), license_number VARCHAR(50), credential_status VARCHAR(20), -- ACTIVE, EXPIRED, SUSPENDED network_status VARCHAR(20), -- IN_NETWORK, OUT_OF_NETWORK effective_date DATE, termination_date DATE, PRIMARY KEY (enterprise_provider_id), UNIQUE (npi) );
The NPI is the universal provider identifier — use it as the deduplication key across source systems. The NPI Lookup tool can validate NPI numbers and retrieve taxonomy classifications from the NPPES registry directly.
Common MDM Failures in Provider Data
- Group NPI vs. Individual NPI confusion: Claims paid to group NPI 1234567890 but attributed to rendering NPI 9876543210. MDM must maintain both and link them.
- Taxonomy code drift: A provider changes specialty but the taxonomy update doesn't propagate from NPPES to the internal registry. Creates incorrect network status.
- Credentialing lag: Provider onboarded to claims before credentialing completes. MDM should gate network activation on credentialing status.
Domain 3: Payer and Plan Master Data
Plan master data governs the products your organization offers or interacts with: benefit plans, formularies, network configurations, and contract terms. In a health plan, inaccurate plan data is a root cause of incorrect member cost-sharing, failed prior authorization checks, and misapplied formulary rules.
Plan Master Data Model
CREATE TABLE mdm.plan_master ( plan_id VARCHAR(36) NOT NULL, plan_name VARCHAR(200) NOT NULL, plan_type VARCHAR(20) NOT NULL, -- HMO, PPO, EPO, HDHP, MA, PDP lob VARCHAR(30) NOT NULL, -- COMMERCIAL, MEDICARE, MEDICAID cms_contract_number VARCHAR(10), -- for Medicare Advantage cms_plan_id VARCHAR(10), -- H, R, or S number effective_date DATE NOT NULL, termination_date DATE, formulary_id VARCHAR(36), network_id VARCHAR(36), service_area_state CHAR(2), metal_tier VARCHAR(10), -- BRONZE, SILVER, GOLD, PLATINUM is_active BOOLEAN NOT NULL DEFAULT TRUE, PRIMARY KEY (plan_id) );
MDM Architecture Patterns
Registry style (virtual MDM): A crosswalk table links source IDs to a master ID without physically consolidating records. Fast to implement; requires all consumers to join through the crosswalk.
Consolidation style (physical golden record): A golden record is created by merging source attributes. Best match attribute wins per field. More reliable for analytics but requires ongoing merge maintenance.
Coexistence style: Both source records and the golden record exist and are synchronized in both directions. Most common in healthcare because source systems (Epic, Facets) remain the system of record for clinical and operational data.
Key Takeaways
- Patient MDM is the foundation of population health analytics. Start with probabilistic matching, not exact match on SSN.
- The crosswalk table pattern is the most portable MDM architecture — it works regardless of whether you consolidate records or not.
- NPI is the universal deduplication key for providers. Use the NPI Lookup to validate and enrich provider records from NPPES.
- Plan master data failures cause downstream errors in cost-sharing, prior auth, and formulary logic — not just data quality dashboards.
- MDM is a process, not a one-time project. Merge confidence scores must be reviewed continuously as new records flow in.
mdatool Team
The mdatool team builds free engineering tools for healthcare data architects, analysts, and engineers working across payer, provider, and life sciences data.
Related Guides
More in Healthcare Data Modeling
HEDIS Measure Data Architecture: Building a Compliant Quality Reporting Pipeline
HEDIS measure calculation is deceptively complex. NCQA technical specifications are hundreds of pages long, source data comes from four different systems, and one logic error can affect thousands of members. Here is how to build a pipeline that holds up to NCQA audit.
Read morePrior Authorization Data Modeling: End-to-End Architecture Guide
Prior authorization is one of the most operationally complex workflows in healthcare — and one of the most data-intensive. Here is the end-to-end data model, from PA request through appeal.
Read moreHow to Model Social Determinants of Health (SDOH) Data in Your Warehouse
SDOH data is increasingly required for quality reporting, care management, and value-based contracts — but most warehouses treat it as an afterthought. Here is a practical data model that makes SDOH analytically useful.
Read moreReady to improve your data architecture?
Free tools for DDL conversion, SQL analysis, naming standards, and more.