Introduction
Most healthcare data dictionaries fail silently. They are built once during an implementation, left on a SharePoint drive, and quietly abandoned when the team that wrote them moves on. Two years later, clm_pd_amt means paid amount to the claims team and pending amount to the finance team. A HEDIS submission fails because the denominator logic joined on mbr_id in one table and member_key in another — and nobody documented that they are not the same field. An HCC audit produces incorrect RAF scores because icd_diag_cd_1 in the silver layer does not mean the same thing it did in the source system.
Free Tool
Calculate RAF scores with our free HCC Calculator →
A data dictionary is not documentation overhead. It is the contract that makes every downstream use of healthcare data reliable. This guide is about building one that stays alive — connected to your actual schema, maintained through a governance process, and useful to both engineers and business analysts.
What Is a Healthcare Data Dictionary?
A healthcare data dictionary is a structured catalog of every data element in your warehouse — its name, definition, data type, source system, business meaning, and governance rules. Each entry answers the questions that engineers and analysts ask when they encounter a field they have never seen before: What does this column mean? Where did it come from? What are the valid values? What does NULL mean here?
Data Dictionary vs. Data Catalog
These terms are often used interchangeably, but they are different things:
Data catalog: A system-of-record for data assets — tables, files, APIs, pipelines. A catalog tells you what data exists, where it lives, and who owns it. Examples: Collibra, Alation, Microsoft Purview.
Data dictionary: A term-level reference for individual data elements and their definitions. A dictionary tells you what a specific field means, what it contains, and how to use it correctly. A dictionary lives inside or alongside a catalog.
Healthcare organizations need both. The catalog provides lineage and discovery. The dictionary provides the semantic precision that healthcare data requires — because in healthcare, the difference between clm_pd_amt and clm_alwd_amt is not obvious from the column name, and confusing the two in a revenue analytics query produces materially wrong results.
Why Healthcare Is Uniquely Complex
Healthcare data dictionaries have to cover terrain that no other industry demands:
- Multiple coding systems: ICD-10-CM, ICD-10-PCS, CPT, HCPCS, NDC, NPI, NUCC taxonomy — each with its own structure, versioning, and valid value set
- HIPAA PHI identifiers: 18 specific data elements that require classification, access control, and masking — and they must be identifiable in the dictionary
- CMS reporting requirements: Columns that feed RAPS/EDPS submissions, HEDIS measures, and Stars calculations have precise definitions set by CMS and NCQA, not by your organization
- Multi-source integration: A single
mbr_idfield may represent different populations depending on whether it came from the claims system, the enrollment system, or the EHR — and a dictionary that does not capture this distinction causes silent join failures
Free Tool
Look up any NPI number instantly →
Why Healthcare Data Dictionaries Fail
Understanding failure modes is the fastest way to avoid them.
Built once, never maintained. A dictionary that was accurate at go-live is wrong within six months. Schema changes, new source systems, and coding standard updates all invalidate entries. Without a maintenance process — a review cadence and an owner for each domain — the dictionary becomes a liability instead of an asset.
Too technical for business users. Definitions written for engineers ("SHA-256 hash of the concatenated source system identifier and member enrollment sequence number") are unusable by the finance analyst trying to understand why two reports produce different member counts. Definitions need to work at both levels.
Too vague for engineers. Definitions written for business users ("the member's unique identifier") do not answer the questions engineers actually have: Is this nullable? What happens when a member re-enrolls and gets a new ID? Does this join to the provider table? Vague definitions produce incorrect queries.
No standard abbreviations. When DOB, birth_dt, member_birth_date, and pt_dob all exist in the same warehouse and all mean date of birth, the dictionary has no way to establish which one is canonical. Without ISO-11179 naming standards as the foundation, the dictionary documents inconsistency instead of resolving it.
Not linked to the actual schema. A dictionary in a spreadsheet that is not connected to the live column list falls out of sync immediately. New columns get added without entries. Renamed columns leave orphan definitions. The dictionary needs to be schema-aware — generated from or synchronized with the information schema.
Real consequences: An undocumented difference between clm_pd_dt (date payment was issued) and clm_adj_dt (date the claim was adjudicated) caused one payer's accounts payable reconciliation to report a $4M discrepancy that required a three-week investigation. The root cause was a single missing definition.
The 7 Essential Elements of a Data Dictionary Entry
A useful dictionary entry answers seven questions. Here is a complete example for one of the most commonly misused fields in claims data:
Term: Claim Paid Amount
Abbreviation: clm_pd_amt
Definition: The total dollar amount paid by the insurance payer to the provider or member for a specific claim, after applying contractual adjustments, deductibles, copayments, and coinsurance. Distinct from clm_alwd_amt (the contractually allowed amount before member liability is deducted) and clm_chrg_amt (the provider's original billed charge). A value of 0.00 indicates the claim was adjudicated but no payment was issued — typically a fully denied or zero-pay claim.
Data Type: NUMBER(12,2) in Snowflake; NUMERIC in BigQuery; DECIMAL(12,2) in Databricks
Source System: Claims adjudication system (Facets, QNXT, or TriZetto). Populated after adjudication completes. NULL in pre-adjudication staging tables.
Category: Claims
Example Values: 125.50, 0.00, 1250.00, -45.00 (negative values represent clawbacks or recoupments)
Null Handling: NULL indicates the claim has not yet been adjudicated. A value of 0.00 is meaningfully different from NULL — zero means adjudicated with no payment; NULL means adjudication is pending. Queries that treat NULL and 0.00 as equivalent will undercount pending claims.
Related Terms: clm_alwd_amt, clm_chrg_amt, clm_mbr_liab_amt, clm_pd_dt
These seven elements — term, abbreviation, definition, data type, source system, category, example values, null handling, and related terms — are the minimum viable entry. The definition and null handling sections are where most dictionaries fail. A one-sentence definition is almost never sufficient for healthcare fields.
Standard Abbreviations — The Foundation of a Good Data Dictionary
A data dictionary built on inconsistent naming will never achieve the clarity it needs. Before writing definitions, establish the naming standard. For healthcare data warehouses, that standard is ISO-11179.
ISO-11179 structures every column name as three parts: object class + property + representation. mbr_birth_dt = member (object class) + birth (property) + date (representation). For a full walkthrough of the standard and platform-specific DDL examples, see ISO-11179 Naming Convention Complete Guide.
The most important suffixes:
| Suffix | Meaning | Example |
|---|---|---|
_dt | Calendar date | clm_svc_dt, mbr_birth_dt |
_ts | Timestamp with time | load_ts, adj_ts |
_amt | Monetary amount | clm_pd_amt, rx_copay_amt |
_cd | Code value | icd_diag_cd, clm_sts_cd |
_id | Unique identifier | mbr_id, clm_id |
_nm | Name | prvdr_nm, mbr_last_nm |
_nbr | Non-identity number | clm_ln_nbr, subscr_grp_nbr |
_flg | Boolean flag | elig_actv_flg, drug_gnrc_flg |
_cnt | Count | diag_cd_cnt, rx_days_sup_cnt |
_pct | Percentage | copay_pct |
_txt | Free text | denial_rsn_txt |
Use the Naming Auditor to check your existing schema against these standards and generate a prioritized list of violations before you start writing definitions.
Building Your Healthcare Data Dictionary — Step by Step
Step 1: Inventory Your Tables and Columns
Start with what exists. Run this query against your Snowflake information schema to export every column in your healthcare warehouse as the raw material for your dictionary:
-- Snowflake: export full column inventory for dictionary build
SELECT
c.table_schema,
c.table_name,
c.column_name,
c.data_type,
c.character_maximum_length,
c.numeric_precision,
c.numeric_scale,
c.is_nullable,
c.column_default,
t.comment AS table_comment,
c.comment AS column_comment
FROM information_schema.columns c
JOIN information_schema.tables t
ON c.table_schema = t.table_schema
AND c.table_name = t.table_name
WHERE c.table_schema NOT IN ('INFORMATION_SCHEMA', 'PUBLIC')
AND t.table_type = 'BASE TABLE'
ORDER BY c.table_schema, c.table_name, c.ordinal_position;
Export this to a spreadsheet or load it into your dictionary tool. The comment columns are often empty on legacy tables — that is your backlog.
For BigQuery, query INFORMATION_SCHEMA.COLUMN_FIELD_PATHS; for Databricks, use DESCRIBE TABLE EXTENDED or the Unity Catalog information_schema.columns view.
Step 2: Categorize by Domain
Assign every table and column to a domain. Healthcare data warehouses have five standard domains:
- Claims: Adjudicated professional, institutional, and pharmacy claims; remittance data
- Clinical: Encounters, diagnoses, procedures, lab results, EHR data
- Pharmacy: Prescription fill records, NDC codes, PBM adjudication data
- Member: Enrollment, eligibility, demographics, plan assignment
- Provider: NPI registry, credentialing, network participation, specialty data
Domains determine which business stakeholders own each section of the dictionary and which abbreviation prefixes apply. A column with no domain assignment is a governance gap.
Browse the mdatool glossary for the full standard term set organized by domain — it covers 100,000+ healthcare terms with definitions, abbreviations, and data type guidance across claims, clinical, member, pharmacy, and provider domains.
Step 3: Standardize Abbreviations
Before writing definitions, audit and standardize every column name. Non-standard names — DOB, PatientID, CLAIM_NUMBER — make the dictionary impossible to keep synchronized with the schema because the same concept appears under different names in different tables.
Run the Naming Auditor to identify violations. For columns that need renaming, use views to decouple the migration from application changes: create a view with ISO-11179 compliant names, migrate consumers to the view, then rename the base column in a maintenance window.
Step 4: Write Definitions
Write definitions that work for two audiences simultaneously: the data engineer joining tables and the business analyst writing a report. A definition that only satisfies one audience will be abandoned by the other.
Requirements for every definition:
- Minimum 150 characters: One-sentence definitions are almost always insufficient for healthcare fields. Explain the field's role in the data model, not just what it represents.
- Distinguish from related fields: For fields that are commonly confused (
clm_pd_amtvs.clm_alwd_amt), the definition must state explicitly how they differ. - Healthcare context required: Define what the field means in a healthcare business context, not just its technical format. "The date the claim was paid" is weaker than "The date the adjudication system issued the payment to the provider, which may be 30–60 days after the service date."
- State the unit: Monetary amounts should specify the currency. Date fields should specify whether they capture the start of a period or the event date. Counts should specify what is being counted.
Step 5: Add Data Quality Rules
Each dictionary entry should include the quality rules that govern it. These are not aspirational — they should match the validation logic actually running in your ETL pipeline:
- Valid value ranges:
clm_pd_amtmust be >= -100000 and <= 10000000;mbr_gndr_cdmust be in (M, F, U, X) - Null handling: Document when NULL is a valid business state vs. a data quality failure.
clm_pd_dtis NULL for pending claims — expected.clm_idis NULL — never acceptable. - Referential integrity:
mbr_idin the claims fact table must resolve to a row indim_member. Document the lookup key and the expected match rate (100% for adjudicated claims; lower for eligibility gaps in some pipelines). - Format constraints:
prvdr_npiis always exactly 10 digits and passes the Luhn check.ndc_cdis always 11 digits with no hyphens.
Step 6: Link to Source Systems
Every field in your dictionary should trace back to a source system and a transformation rule. This is what makes the dictionary useful during an audit or a data quality investigation:
- Source system: Claims adjudication (Facets, QNXT), enrollment (Facets, MMIS), EHR (Epic, Cerner), clearinghouse (Change Healthcare, Availity)
- Source field name: What the field is called in the source system before transformation
- Transformation rule: Whether the field is passed through unchanged, cast from a different type, derived from multiple source fields, or defaulted when the source is NULL
Fields with no documented source and no transformation rule are the ones that fail during a RADV audit when CMS asks you to trace a specific HCC submission back to the originating encounter.
Step 7: Maintain and Govern
A data dictionary without a governance process degrades within 90 days. Establish the minimum viable maintenance program:
- Domain stewards: One named person per domain (claims, clinical, member, pharmacy, provider) is responsible for reviewing their section quarterly and approving new additions
- Schema change process: Any DDL change that adds, removes, or renames a column triggers a mandatory dictionary update before the change deploys to production — enforced as a PR gate
- Review cadence: Full dictionary review annually against CMS code set updates (ICD-10-CM updates annually on October 1; NDC, CPT, and HCPCS update throughout the year)
- Deprecation policy: Columns removed from the schema are marked deprecated in the dictionary for 12 months before their entries are archived — so that historical queries can still be explained
Healthcare Data Dictionary by Domain — Examples
Claims Domain
The claims domain contains the highest-stakes data in a payer warehouse. These five entries are the most commonly missing or incorrect:
| Column | Abbreviation | Data Type | Definition Summary | Null Handling |
|---|---|---|---|---|
| Claim Identifier | clm_id | VARCHAR(50) | Unique identifier assigned by the adjudication system at claim receipt. Immutable after assignment. | Never NULL after adjudication. NULL in pre-adjudication staging only. |
| Paid Amount | clm_pd_amt | NUMBER(12,2) | Dollar amount paid to provider after contractual adjustments and member liability. Zero-pay adjudicated claims have value 0.00, not NULL. | NULL = pending adjudication. 0.00 = adjudicated, no payment issued. |
| Service Date | clm_svc_dt | DATE | First date of service for the claim. For inpatient, the admission date. For professional, the date of the primary procedure. | NOT NULL for adjudicated claims. |
| Claim Status Code | clm_sts_cd | VARCHAR(20) | Adjudication status: PAID, DENIED, PENDED, VOID, ADJUSTED. Drives downstream financial reporting and appeals workflows. | NOT NULL. Default PENDED on receipt. |
| Primary Diagnosis | icd_diag_cd_1 | VARCHAR(7) | ICD-10-CM diagnosis code in the primary position. Validated against the CMS ICD-10-CM reference table effective as of the service date — not the processing date. | Nullable for pharmacy claims. Required for professional and institutional. |
See Healthcare Claims Data Model: Complete SQL Schema for the full Star Schema these columns belong to.
Member Domain
The member domain is the most common source of join failures in healthcare analytics. Members re-enroll, change plans, and have multiple IDs across source systems.
| Column | Abbreviation | Data Type | Definition Summary | Null Handling |
|---|---|---|---|---|
| Member Identifier | mbr_id | VARCHAR(50) | Health plan-assigned member ID. PHI. Does not change within an enrollment period; may be reassigned on re-enrollment. Use enterprise_mbr_id for cross-period joins. | NOT NULL. |
| Effective Date | mbr_eff_dt | DATE | First date the member's current coverage period is active. Used as the lower bound in date-range eligibility checks. | NOT NULL. |
| Termination Date | mbr_term_dt | DATE | Last date of coverage for this enrollment record. NULL indicates currently active enrollment — do not interpret NULL as missing data. | NULL = currently active. Date = terminated. |
| Date of Birth | mbr_birth_dt | DATE | Member date of birth. PHI. Used in age-based HEDIS denominator calculations and CMS risk adjustment age brackets. | NOT NULL for CMS reporting. May be NULL in pre-enrollment feeds. |
| Gender Code | mbr_gndr_cd | CHAR(1) | Member gender: M (Male), F (Female), U (Unknown), X (Non-binary). Used in gender-stratified HEDIS measures. Sourced from enrollment application; may differ from clinical gender identity in EHR. | NOT NULL for CMS reporting. |
Clinical Domain
The clinical domain intersects with risk adjustment, quality measurement, and care management — making definitional precision especially consequential.
| Column | Abbreviation | Data Type | Definition Summary | Null Handling |
|---|---|---|---|---|
| Encounter Identifier | enc_id | VARCHAR(50) | Unique identifier for a clinical encounter. May differ from the claim ID — one encounter can generate multiple claims (split billing) or one claim can span multiple encounters (bundled payment). | NOT NULL. |
| Admit Date | enc_adm_dt | DATE | Date the patient was admitted for inpatient encounters. For outpatient, equivalent to the service date. Used as the encounter start date in longitudinal analysis. | NOT NULL for inpatient. |
| HCC Condition Code | hcc_cd | VARCHAR(10) | CMS-HCC model condition category code. Derived from ICD-10-CM diagnosis codes using the CMS grouper algorithm. Version must be tracked (hcc_model_ver) — V28 and V24 produce different mappings. | NULL if no HCC-qualifying diagnosis on the encounter. |
| RAF Score | raf_score_amt | NUMBER(8,4) | Risk Adjustment Factor score for the member, computed from demographic factors and HCC conditions. Named with _amt suffix per ISO-11179 because it is a continuous numeric measure, not a code or count. | NOT NULL for members in risk-bearing arrangements. |
| Discharge Status Code | dsch_sts_cd | VARCHAR(2) | UB-04 discharge status code indicating patient disposition: 01 (home), 02 (short-term hospital), 20 (expired), etc. Required for inpatient institutional claims. | NULL for outpatient encounters. NOT NULL for inpatient. |
Tools for Building and Maintaining Your Dictionary
mdatool Glossary — 100,000+ standardized healthcare data terms with definitions, abbreviations, data types, and domain categorization. Use it as the canonical reference when writing definitions for claims, clinical, member, pharmacy, and provider fields. Search for any term to see its standard abbreviation, definition, and related terms.
Naming Auditor — Paste your schema or column list and get an instant report of ISO-11179 compliance violations with suggested corrections. Start here before writing a single definition — a dictionary built on non-standard names requires complete rework.
Column Name Generator — Generate ISO-11179 compliant column names from plain-English descriptions. Useful when adding new fields to an existing schema and needing to maintain naming consistency across thousands of existing columns.
AI Data Modeling — Generate production-ready schema DDL with ISO-11179 compliant naming for any healthcare domain and architecture pattern (Star Schema, Data Vault, Lakehouse). The generated DDL provides a compliant starting point that the dictionary can document rather than needing to correct.
Dictionary Build Checklist
Before calling your dictionary production-ready, verify each item:
- Full column inventory exported from information schema — no tables or columns missing
- Every column assigned to a domain (claims, clinical, member, pharmacy, provider)
- ISO-11179 naming audit complete — violations corrected or documented for remediation
- Every entry has a definition of 150+ characters that covers both technical and business meaning
- PHI columns flagged and HIPAA classification documented
- Null handling documented for every nullable column
- Data quality rules match the validation logic running in the ETL pipeline
- Every field traces to a source system and a transformation rule
- Domain stewards assigned — one named owner per domain
- Schema change PR gate in place — no DDL merge without a dictionary update
- Annual review scheduled against CMS ICD-10-CM update cycle
Key Takeaways
- A data dictionary built on inconsistent naming cannot be maintained. Establish ISO-11179 naming standards before writing a single definition — see the ISO-11179 Complete Guide.
- The definition and null handling sections are where most healthcare dictionaries fail. NULL and 0.00 mean different things in claims data. One sentence is almost never sufficient.
- PHI fields must be explicitly identified in the dictionary — not just in your access control system. If a HIPAA audit asks which columns contain date of birth, the dictionary should answer that question instantly.
- A dictionary without a maintenance process is a liability. Assign domain stewards, enforce dictionary updates as a PR gate, and schedule reviews against the CMS ICD-10-CM annual update cycle.
- Start with the mdatool glossary for definitions — 100,000+ terms already defined, abbreviated, and categorized by domain. Use the Healthcare Analytics Guide for how these terms connect into analytics workflows.
Frequently Asked Questions
What is a healthcare data dictionary?
A healthcare data dictionary is a structured reference that defines every data element in a healthcare data warehouse — the column name, abbreviation, definition, data type, source system, valid values, null handling rules, and related fields. It is the authoritative source of meaning for every field, used by data engineers writing queries, business analysts interpreting reports, compliance teams responding to HIPAA audits, and CMS reporting teams validating submissions. A dictionary that is incomplete, inconsistent, or out of sync with the live schema is worse than no dictionary — it creates false confidence.
How is a data dictionary different from a data catalog?
A data catalog inventories data assets — tables, files, pipelines, APIs — and provides lineage, ownership, and discovery capabilities. A data dictionary defines individual data elements — what a specific column means, what values it can contain, and how it relates to other fields. The catalog answers "what data exists and where"; the dictionary answers "what does this field mean and how do I use it correctly." Healthcare organizations need both: the catalog for governance and discovery, the dictionary for semantic precision.
What should a healthcare data dictionary include?
At minimum, each entry needs: the term name, its standard abbreviation following ISO-11179 conventions, a definition of 150+ characters covering both technical format and business meaning, the data type (platform-specific), the source system, the domain category, example values, null handling rules (NULL vs. a default value is often a meaningful distinction in claims data), and related terms that are commonly confused with this field. PHI fields must be explicitly flagged. Fields that feed CMS submissions (RAPS/EDPS, HEDIS, Stars) need additional documentation of the CMS specification they fulfill.
How do I standardize column names in my healthcare data warehouse?
Start with an ISO-11179 audit. Run the Naming Auditor against your information schema to identify non-compliant names — mixed case, full words instead of approved abbreviations, missing type suffixes. For legacy tables, create views with compliant names over the non-compliant base tables, migrate downstream consumers to the views, then rename base tables in a maintenance window. This decouples the naming migration from application deployments and gives you a rollback path. The ISO-11179 Naming Convention Complete Guide covers the full standard with platform-specific DDL examples for Snowflake, BigQuery, and Databricks.
What is the ISO-11179 naming standard?
ISO/IEC 11179 is an international standard for naming and identifying data elements. Its naming convention structures every column name as three parts: object class (what domain the field belongs to: mbr, clm, prvdr), property (what characteristic is captured: birth, paid, svc), and representation (how it is stored: dt for date, amt for amount, cd for code). The result — mbr_birth_dt, clm_pd_amt, prvdr_npi — is a column name that communicates its meaning without a data dictionary lookup. Healthcare adopted ISO-11179 because it aligns with how EDI transactions, FHIR resources, and CMS reporting specifications already organize data. For a complete guide with examples across all five healthcare domains, see ISO-11179 Naming Convention Complete Guide.
mdatool Team
The mdatool team builds free engineering tools for healthcare data architects, analysts, and engineers working across payer, provider, and life sciences data.
More in Data Governance
Healthcare Data Contracts: How to Enforce Schema Standards Across Teams
A data contract is a formal agreement between the team that produces data and the teams that consume it — specifying schema, quality rules, SLAs, and ownership. In healthcare, where a schema change in the claims pipeline can break downstream HEDIS calculations, data contracts are a stability mechanism, not a formality.
Read moreSOC 2 Type II for Healthcare Data Platforms: What Engineers Need to Know
SOC 2 Type II is increasingly a vendor requirement and a customer expectation for healthcare data platforms. Here is what engineers need to implement — beyond what the auditors tell you.
Read more21st Century Cures Act: Data Architecture Requirements for Health IT Teams
The 21st Century Cures Act is not just a compliance checkbox — it mandates specific technical capabilities around open APIs, information blocking prohibition, and patient data access. Here is what your data architecture must deliver.
Read moreReady to improve your data architecture?
Free tools for DDL conversion, SQL analysis, naming standards, and more.
Get weekly healthcare data engineering tips
Practical guides on data modeling, SQL standards, and healthcare domain conventions — straight to your inbox.
No spam. Unsubscribe any time.