Healthcare claims data is the backbone of payer analytics. Every time a patient receives care — a physician visit, a hospital stay, a prescription fill — a claim is generated and submitted to the insurance payer for reimbursement. For health plans, PBMs, and provider organizations, claims data is the primary source of truth for financial reporting, quality measurement, risk adjustment, and population health analytics.
This guide covers everything a healthcare data architect, data modeler, or data engineer needs to know about claims data — from the core data elements and EDI transaction standards to production-ready schema design for Snowflake, Databricks, and BigQuery.
What Is Healthcare Claims Data?
A healthcare claim is a formal request for reimbursement submitted by a healthcare provider to an insurance payer after delivering covered services to a member. Claims data is the structured, coded representation of that request — capturing who provided the service, who received it, what was done, when and where it happened, and how much was charged.
Claims data is generated across three claim types:
Professional claims (CMS-1500 / EDI 837P) are submitted by physicians, nurse practitioners, physician assistants, and other professional providers for outpatient and office-based services. Professional claims identify the rendering provider by NPI, describe services using CPT and HCPCS procedure codes, report diagnoses using ICD-10-CM codes, and specify the place of service.
Institutional claims (UB-04 / EDI 837I) are submitted by hospitals, skilled nursing facilities, ambulatory surgery centers, and other facility providers. Institutional claims use revenue codes to identify service categories, report DRG assignments for inpatient stays, and include detailed service line information for each billable item.
Pharmacy claims (NCPDP / EDI 837D) are submitted by retail pharmacies, mail-order pharmacies, and specialty pharmacies for prescription drug dispensing. Pharmacy claims identify drugs by National Drug Code, report days supply and quantity dispensed, and capture formulary tier and prior authorization information.
Core Claims Data Elements
Understanding the key data elements in claims data is essential before designing any claims data warehouse. Here are the most important fields every healthcare data team works with:
The claim identifier (clm_id) is the unique key assigned to each claim transaction. It links all service lines within a claim and serves as the primary join key across claims processing tables.
The claim type code (clm_typ_cd) distinguishes professional, institutional, and pharmacy claims and drives which adjudication rules and fee schedules apply.
Diagnosis codes (diag_cd) captured as ICD-10-CM codes identify the medical conditions that justify the services billed. Claims may carry up to 25 diagnosis codes, with the principal diagnosis in the first position driving DRG assignment for inpatient claims.
Procedure codes (proc_cd) identify the specific services performed using CPT codes for professional services and HCPCS Level II codes for supplies, drugs, and non-physician services.
The claim paid amount (clm_pd_amt) is the total dollar amount the payer reimbursed for the claim after applying contractual adjustments, member cost-sharing, and any other reductions from billed charges.
The claim billed amount (clm_bill_amt) is the provider standard charge before any adjustments, representing the gross charge submitted to the payer.
The claim allowed amount (clm_alwd_amt) is the maximum the payer will pay based on the contracted fee schedule, serving as the basis for member cost-sharing calculations.
The claim status code (clm_sts_cd) indicates whether the claim was paid, denied, pending, or adjusted and is the primary field for tracking adjudication outcomes.
The service from date (svc_from_dt) and service to date (svc_to_dt) identify when care was delivered and anchor all date-based analytics including eligibility verification and quality measure assignment.
The rendering provider NPI (prvdr_rndrg_npi) identifies the individual clinician who delivered the service, enabling provider-level performance measurement.
The member identifier (mbr_id) links the claim to the member enrollment record, enabling longitudinal analytics across all care settings for a given member.
Claims Data Standards and EDI Transactions
Healthcare claims data flows through standardized electronic data interchange formats mandated under HIPAA. Every healthcare data engineer working with claims data needs to understand these transaction standards:
EDI 837 is the HIPAA standard electronic claims transaction. The 837P is used for professional claims, 837I for institutional claims, and 837D for dental claims. Understanding 837 loop and segment structure is essential for parsing raw claims feeds from clearinghouses.
EDI 835 is the electronic remittance advice transaction that communicates how a claim was paid or denied. The 835 contains Claim Adjustment Reason Codes and Remittance Advice Remark Codes that explain payment decisions. Healthcare data teams process 835 transactions to populate denial reason fields and calculate contractual adjustment amounts.
ICD-10-CM is the diagnosis code standard used on all claim types since October 2015, replacing ICD-9. With over 70,000 codes organized in a hierarchical structure, ICD-10-CM provides the clinical specificity needed for risk adjustment, quality measurement, and population health analytics. Search our ICD-10 code library to find diagnosis codes for any condition.
CPT codes maintained by the American Medical Association identify physician procedures and evaluation and management services. CPT codes drive professional claim reimbursement under the Medicare Physician Fee Schedule and commercial fee schedules.
Revenue codes are four-digit codes used on institutional claims to classify service categories including room and board, operating room, laboratory, radiology, and pharmacy services.
DRG codes are the Medicare inpatient payment classification system. Each inpatient admission is assigned to one of over 750 MS-DRGs based on principal diagnosis, secondary diagnoses, procedures, age, and discharge status, with a fixed payment weight determining reimbursement.
Claims Data Model Design
Designing an effective claims data warehouse requires careful consideration of grain, dimensionality, and the analytical use cases the schema needs to support. Here is a production-ready Star Schema for healthcare claims analytics.
Fact Table: FACT_CLAIM_HEADER
The claim header fact table captures one row per claim with summary financial amounts and key dimension foreign keys:
-- Snowflake DDL — generated with mdatool AI Data Modeling
CREATE TABLE FACT_CLAIM_HEADER (
CLM_ID VARCHAR(50) NOT NULL,
MBR_KEY INTEGER NOT NULL,
PRVDR_KEY INTEGER NOT NULL,
PLAN_KEY INTEGER NOT NULL,
SVC_FROM_DT_KEY INTEGER NOT NULL,
SVC_TO_DT_KEY INTEGER NOT NULL,
CLM_TYP_CD VARCHAR(10),
CLM_STAT_CD VARCHAR(10),
TOT_CHRG_AMT DECIMAL(18,2),
TOT_ALWD_AMT DECIMAL(18,2),
TOT_PD_AMT DECIMAL(18,2),
TOT_MBR_LIAB_AMT DECIMAL(18,2),
TOT_COINS_AMT DECIMAL(18,2),
TOT_COPAY_AMT DECIMAL(18,2),
TOT_DED_AMT DECIMAL(18,2),
DIAG_CD_1 VARCHAR(10),
DIAG_CD_2 VARCHAR(10),
DIAG_CD_3 VARCHAR(10),
DRG_CD VARCHAR(10),
ADMIT_DT DATE,
DSCH_DT DATE,
DSCH_DISP_CD VARCHAR(5),
LOAD_DT TIMESTAMP_NTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
CONSTRAINT PK_FACT_CLAIM_HEADER PRIMARY KEY (CLM_ID)
);
Fact Table: FACT_CLAIM_LINE
The claim line fact table captures one row per service line with procedure-level detail:
CREATE TABLE FACT_CLAIM_LINE (
CLM_ID VARCHAR(50) NOT NULL,
CLM_LINE_NBR SMALLINT NOT NULL,
PRVDR_KEY INTEGER NOT NULL,
SVC_DT_KEY INTEGER NOT NULL,
PROC_CD VARCHAR(10),
PROC_MOD_CD_1 VARCHAR(2),
PROC_MOD_CD_2 VARCHAR(2),
REV_CD VARCHAR(4),
POS_CD VARCHAR(2),
DIAG_CD VARCHAR(10),
UNITS_QTY DECIMAL(10,3),
CHRG_AMT DECIMAL(18,2),
ALWD_AMT DECIMAL(18,2),
PD_AMT DECIMAL(18,2),
DENIAL_RSN_CD VARCHAR(10),
LOAD_DT TIMESTAMP_NTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
CONSTRAINT PK_FACT_CLAIM_LINE
PRIMARY KEY (CLM_ID, CLM_LINE_NBR),
CONSTRAINT FK_CLAIM_LINE_HEADER
FOREIGN KEY (CLM_ID) REFERENCES FACT_CLAIM_HEADER(CLM_ID)
);
Dimension Tables
CREATE TABLE DIM_MEMBER (
MBR_KEY INTEGER NOT NULL,
MBR_ID VARCHAR(50) NOT NULL,
MBR_FIRST_NM VARCHAR(100),
MBR_LAST_NM VARCHAR(100),
BIRTH_DT DATE,
SEX_CD CHAR(1),
STATE_CD CHAR(2),
ZIP_CD VARCHAR(10),
DUAL_ELIG_IND CHAR(1),
EFF_START_DT DATE NOT NULL,
EFF_END_DT DATE,
CURR_ROW_IND BOOLEAN NOT NULL DEFAULT TRUE,
CONSTRAINT PK_DIM_MEMBER PRIMARY KEY (MBR_KEY)
);
CREATE TABLE DIM_PROVIDER (
PRVDR_KEY INTEGER NOT NULL,
PRVDR_NPI VARCHAR(10),
PRVDR_FIRST_NM VARCHAR(100),
PRVDR_LAST_NM VARCHAR(100),
PRVDR_ORG_NM VARCHAR(255),
PRVDR_TYP_CD VARCHAR(20),
PRVDR_SPCLTY_CD VARCHAR(10),
PRVDR_STATE_CD CHAR(2),
PRVDR_ZIP_CD VARCHAR(10),
EFF_START_DT DATE NOT NULL,
EFF_END_DT DATE,
CURR_ROW_IND BOOLEAN NOT NULL DEFAULT TRUE,
CONSTRAINT PK_DIM_PROVIDER PRIMARY KEY (PRVDR_KEY)
);
CREATE TABLE DIM_DATE (
DT_KEY INTEGER NOT NULL,
FULL_DT DATE NOT NULL,
YR_NBR SMALLINT,
QTR_NBR SMALLINT,
MO_NBR SMALLINT,
MO_NM VARCHAR(20),
WK_NBR SMALLINT,
DAY_OF_WK_NBR SMALLINT,
DAY_OF_WK_NM VARCHAR(20),
HLDY_IND BOOLEAN,
CONSTRAINT PK_DIM_DATE PRIMARY KEY (DT_KEY)
);
Generate this entire schema automatically using the mdatool AI Data Modeling tool — select Claims and Adjudication as your domain, Star Schema as your architecture, and your target platform to get production-ready DDL in 30 seconds.
Common Claims Analytics Use Cases
Healthcare data teams use claims data for a wide range of analytical workflows:
Member Total Cost of Care aggregates all claim payments across medical, pharmacy, and behavioral health to calculate the total cost of care for attributed member populations, used in value-based care program performance measurement and actuarial modeling.
HEDIS Quality Measure Calculation uses claims data to identify eligible members and measure care gap events across dozens of preventive care and chronic disease management measures. Claims-based HEDIS measures include colorectal cancer screening, diabetes care, and follow-up after hospitalization.
HCC Risk Adjustment processes diagnosis codes from claims to identify Hierarchical Condition Category assignments that drive RAF score calculations for Medicare Advantage capitation payments. Use our HCC Calculator to estimate RAF scores from ICD-10 diagnosis codes.
Provider Performance Measurement aggregates claims data at the provider level to calculate cost efficiency, quality measure rates, and utilization patterns used in value-based contract performance evaluation and network management.
Fraud, Waste, and Abuse Detection applies statistical analysis and predictive modeling to claims data to identify billing anomalies including upcoding, unbundling, duplicate billing, and services inconsistent with the member documented clinical condition.
Pharmacy Utilization Management analyzes pharmacy claims by drug category, formulary tier, and therapeutic class to measure medication adherence, identify high-cost specialty drug utilization, and evaluate pharmacy benefit design effectiveness.
Claims Data Quality Considerations
Claims data quality directly affects every downstream analytics use case. Healthcare data teams must address these common data quality challenges:
Duplicate claims occur when the same service is billed multiple times across different claim versions or resubmissions. Deduplication logic must identify the definitive paid version of each claim by selecting the most recent paid or final adjudicated claim per service line.
Claim versions are created each time a claim is adjusted or reprocessed. Analytics pipelines must track claim version history and select the appropriate version for reporting — typically the final paid version for financial analytics and all versions for denial management analytics.
Provider identity resolution is challenging when the same provider bills under multiple NPIs, tax IDs, or group affiliations across different claim sources. Provider master data management processes must resolve these identities to enable accurate provider-level analytics.
Diagnosis code completeness varies significantly across providers and claim types. Claims-based risk adjustment analytics must account for providers who systematically under-document secondary diagnoses, producing artificially low HCC risk scores.
Date field inconsistencies between paid date, processed date, service date, and incurred date require careful attention when building time-series analytics. Always anchor financial analytics to paid date for cash basis reporting and service date for incurred basis reporting.
Claims Data Tools
mdatool provides several free tools specifically designed for healthcare claims data work:
- SQL Linter — Validate your claims SQL against healthcare naming standards and catch common query errors before they reach production
- DDL Converter — Convert claims schema DDL between Snowflake, BigQuery, Databricks, and other platforms instantly
- AI Data Modeling — Generate a complete claims data warehouse schema for any platform in 30 seconds
- ICD-10 Code Search — Search 70,000 plus ICD-10-CM diagnosis codes used in claims data
- HCC Calculator — Calculate Medicare Advantage RAF scores from ICD-10 diagnosis codes
- Data Model Canvas — Visualize your claims schema as an interactive ER diagram
Frequently Asked Questions
What is the difference between a claim header and a claim line? The claim header captures summary information for the entire claim including total amounts, member and provider identifiers, and principal diagnosis. The claim line captures service-level detail for each individual procedure or revenue code billed within the claim. A single claim header may have one to hundreds of associated claim lines depending on the claim type and services rendered.
What is the difference between billed, allowed, and paid amounts on a claim? The billed amount is the provider standard charge before any adjustments. The allowed amount is the maximum the payer will pay based on the contracted fee schedule. The paid amount is what the payer actually paid after applying member cost-sharing including deductibles, copays, and coinsurance. The difference between billed and allowed is the contractual adjustment, and the difference between allowed and paid is the member liability.
How are claim denials tracked in a claims data warehouse? Claim denials are identified by claim status codes indicating non-payment and denial reason codes from the EDI 835 remittance advice. The CARC code explains why the claim was not paid and the RARC code provides additional detail. Claims data warehouses typically maintain a denial fact table or denial flags on the claim line fact table to support denial analytics and denial management workflows.
What is the grain of a claims fact table? Most claims data warehouses maintain two fact tables at different grains — a claim header fact at the one-row-per-claim grain for summary financial analytics and a claim line fact at the one-row-per-service-line grain for procedure-level analytics. Some organizations also maintain a claim diagnosis fact table at the one-row-per-diagnosis grain to support HCC risk adjustment pipelines.
How do you handle retroactive claims adjustments in a claims data warehouse? Retroactive adjustments require tracking claim versions with effective dates and selecting the correct version for reporting. Common approaches include maintaining all claim versions with a version number and adjustment indicator, using a slowly changing dimension approach on the claim fact table, or maintaining a separate adjustment fact table that records changes from the original claim.
What EDI transaction sets are used for healthcare claims? The primary claims EDI transactions are 837P for professional claims, 837I for institutional claims, and 837D for dental claims. The 835 electronic remittance advice returns payment information. The 270/271 transaction set is used for eligibility inquiry and response. The 276/277 transaction set supports claim status inquiry and response.
How is claims data used in HCC risk adjustment? Diagnosis codes from claims are mapped to Hierarchical Condition Category codes using CMS mapping files. Each HCC has a coefficient representing its contribution to the RAF score. The sum of all HCC coefficients plus demographic adjustments for age, gender, and enrollment status produces the member RAF score used to calculate Medicare Advantage capitation payments. Use our HCC Calculator to explore RAF score calculations from ICD-10 codes.
mdatool Team
The mdatool team builds free engineering tools for healthcare data architects, analysts, and engineers working across payer, provider, and life sciences data.
Related Guides
Key Terms in This Article
Free Tools
Free HCC RAF Score Calculator
Calculate CMS-HCC V28 RAF scores from ICD-10 codes. No account required.
Try it freeFree ICD-10 Code Search
Search 70,000+ ICD-10-CM diagnosis codes by description or code prefix.
Try it freeFree SQL Linter
Catch SQL bugs, performance issues, and naming violations before production.
Try it freeReady to improve your data architecture?
Free tools for DDL conversion, SQL analysis, naming standards, and more.
Get weekly healthcare data engineering tips
Practical guides on data modeling, SQL standards, and healthcare domain conventions — straight to your inbox.
No spam. Unsubscribe any time.