Introduction
A poorly written DDL script is a slow-motion data quality problem. The VARCHAR(255) on an NPI column allows a 50-character typo to load without error. The missing NOT NULL on service_date lets null dates propagate into HEDIS measure calculations. The absent unique constraint on member_id lets a deduplication bug create duplicate member records that survive for months before anyone notices.
Free Tool
Look up any NPI number instantly →
In healthcare data warehouses, DDL is a correctness control. The schema you define determines what the database will and will not accept — which means schema decisions made in 20 minutes at the start of a project enforce (or fail to enforce) data quality for the life of the warehouse.
This guide covers practical DDL patterns for healthcare data warehouses: data type selection, constraint design, indexing strategy, PHI field handling, and the structural conventions that make a schema maintainable at scale.
Data Type Selection for Healthcare Fields
Choosing the right data type for each column is the highest-leverage decision in your DDL. Wrong data types cause silent truncation, failed joins, and performance problems that are expensive to fix after data is loaded.
Dates and Times
Healthcare data has three distinct temporal concepts that require different types:
-- Service dates: DATE is almost always correct
-- Never use VARCHAR for a date column
service_date DATE NOT NULL,
discharge_date DATE,
-- Timestamps for audit fields and real-time events
created_datetime TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_datetime TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
🔄Free Tool
Convert this DDL to Snowflake, BigQuery, or PostgreSQL instantly →
-- Fiscal and reporting periods: store as DATE, not VARCHAR('2026Q1') reporting_period_date DATE NOT NULL,
Avoid storing dates as VARCHAR or INTEGER (YYYYMMDD format). Dates stored as strings do not sort correctly, do not support date arithmetic, and accept invalid values like `'2026-02-30'`.
### Monetary Amounts
```sql
-- Use DECIMAL/NUMERIC for monetary amounts — never FLOAT
-- FLOAT has rounding errors that accumulate in aggregate queries
paid_amount DECIMAL(12, 2) NOT NULL DEFAULT 0.00,
billed_amount DECIMAL(12, 2) NOT NULL DEFAULT 0.00,
allowed_amount DECIMAL(12, 2),
copay_amount DECIMAL(10, 2) NOT NULL DEFAULT 0.00,
DECIMAL(12, 2) gives you up to $9.99 billion with two decimal places — sufficient for claim-level amounts. For capitation and actuarial calculations that require more precision, use DECIMAL(18, 6).
Code Fields
Healthcare code fields have well-defined formats. Encode the format in the data type:
-- NPI: exactly 10 digits, stored as CHAR not VARCHAR
npi_id CHAR(10) NOT NULL,
-- [ICD-10](/terms/icd-10) codes: up to 7 characters (e.g., 'A15.0', 'Z23')
diagnosis_code VARCHAR(10) NOT NULL,
-- [CPT](/terms/cpt) codes: 5 characters (numeric or alphanumeric)
procedure_code VARCHAR(5),
-- [NDC](/terms/ndc) codes: 11 digits
ndc_code CHAR(11),
-- Plan IDs and member IDs vary by payer — use VARCHAR with a reasonable cap
member_id VARCHAR(50) NOT NULL,
plan_id VARCHAR(20) NOT NULL,
Use CHAR for fixed-length codes (NPI, NDC). It communicates the expected length and in some platforms improves storage efficiency for fixed-width values.
Constraint Design
Constraints are the most underused feature in healthcare warehouse DDL. Many teams write CREATE TABLE with no constraints other than a PRIMARY KEY. This is a mistake.
Primary Keys
-- Surrogate key pattern — preferred for warehouse tables
CREATE TABLE fct_claims (
claim_key BIGINT NOT NULL GENERATED ALWAYS AS IDENTITY,
claim_id VARCHAR(50) NOT NULL,
claim_line_number SMALLINT NOT NULL,
CONSTRAINT pk_fct_claims PRIMARY KEY (claim_key),
CONSTRAINT uq_fct_claims_natural_key UNIQUE (claim_id, claim_line_number)
);
Name every constraint. A named constraint appears in error messages and migration scripts by name — making debugging and schema diffs readable.
NOT NULL Constraints
Apply NOT NULL to every column that should not be null. In healthcare data, that is most columns:
CREATE TABLE dim_member (
member_key BIGINT NOT NULL GENERATED ALWAYS AS IDENTITY,
member_id VARCHAR(50) NOT NULL,
member_first_name VARCHAR(100) NOT NULL,
member_last_name VARCHAR(100) NOT NULL,
member_dob_date DATE NOT NULL,
member_gender_code CHAR(1) NOT NULL,
plan_id VARCHAR(20) NOT NULL,
effective_date DATE NOT NULL,
termination_date DATE, -- NULL means currently active
is_deleted_flag BOOLEAN NOT NULL DEFAULT FALSE,
created_datetime TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_datetime TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
CONSTRAINT pk_dim_member PRIMARY KEY (member_key),
CONSTRAINT uq_dim_member_id UNIQUE (member_id)
);
Check Constraints
Check constraints enforce business rules at the database level:
ALTER TABLE fct_claims
ADD CONSTRAINT chk_claims_paid_amount
CHECK (paid_amount >= 0),
ADD CONSTRAINT chk_claims_billed_ge_paid
CHECK (billed_amount >= paid_amount OR billed_amount IS NULL),
ADD CONSTRAINT chk_claims_service_date_range
CHECK (service_date >= '2000-01-01'),
ADD CONSTRAINT chk_claims_gender
CHECK (member_gender_code IN ('M', 'F', 'U'));
Foreign Keys
Foreign keys are often omitted in analytical warehouses for performance reasons — but they should be present in your DDL as documentation and enforced in staging layers:
ALTER TABLE fct_claims
ADD CONSTRAINT fk_claims_member
FOREIGN KEY (member_key) REFERENCES dim_member (member_key),
ADD CONSTRAINT fk_claims_provider
FOREIGN KEY (rendering_provider_key) REFERENCES dim_provider (provider_key);
If your warehouse does not enforce foreign keys at runtime (Snowflake and Redshift support declarative FK definitions without enforcement), define them anyway. ERD tools, lineage platforms, and documentation generators read them.
Indexing Strategy for Healthcare Queries
Healthcare analytical queries have predictable access patterns. Your index strategy should reflect them.
Claims Tables
Claims tables are the most frequently queried objects in a payer warehouse. Index for the three most common access patterns:
-- Primary filter: member + date range
CREATE INDEX idx_claims_member_service_date
ON fct_claims (member_key, service_date);
-- Secondary filter: provider + date range
CREATE INDEX idx_claims_provider_service_date
ON fct_claims (rendering_provider_key, service_date);
-- Lookups by claim ID
CREATE INDEX idx_claims_claim_id
ON fct_claims (claim_id);
-- Diagnosis code lookups for clinical analytics and [HCC](/terms/hcc)
CREATE INDEX idx_claims_diagnosis_code
ON fct_claims (primary_diagnosis_code);
Eligibility Tables
Eligibility queries almost always filter on a date range to find active coverage:
-- Point-in-time eligibility lookup
CREATE INDEX idx_eligibility_member_dates
ON fct_member_eligibility (member_id, effective_date, termination_date);
Code Reference Tables
Reference tables (ICD-10, CPT, NDC) are small but queried on every claim load. Index the code column:
CREATE UNIQUE INDEX uq_ref_icd10_code ON ref_icd10 (icd10_code);
CREATE UNIQUE INDEX uq_ref_cpt_code ON ref_cpt (cpt_code);
Handling PHI Fields in DDL
PHI columns require explicit handling at the DDL level — not just in application code.
Column-Level Comments
Document PHI fields in your DDL with column comments that your governance tooling can read:
-- Snowflake column comment syntax
ALTER TABLE dim_member MODIFY COLUMN member_ssn_phi
COMMENT 'PHI: Social Security Number. Masked for all roles except compliance_admin.';
ALTER TABLE dim_member MODIFY COLUMN member_dob_date
COMMENT 'PHI: Date of Birth. Generalized to birth year for roles below analyst_phi tier.';
Data Classification Tags
In platforms that support it, tag PHI columns at the DDL stage:
-- Snowflake object tagging
ALTER TABLE dim_member MODIFY COLUMN member_ssn_phi
SET TAG governance.phi_classification = 'SSN';
ALTER TABLE dim_member MODIFY COLUMN member_dob_date
SET TAG governance.phi_classification = 'DOB';
This approach means masking policies can target the tag rather than hardcoding column names — surviving column renames.
DDL for Slowly Changing Dimensions in Healthcare
Member enrollment, provider rosters, and plan configurations change over time. Use Type 2 SCD patterns:
CREATE TABLE dim_member_scd2 (
member_key BIGINT NOT NULL GENERATED ALWAYS AS IDENTITY,
member_id VARCHAR(50) NOT NULL,
member_first_name VARCHAR(100) NOT NULL,
member_last_name VARCHAR(100) NOT NULL,
member_dob_date DATE NOT NULL,
plan_id VARCHAR(20) NOT NULL,
-- SCD2 control columns
effective_date DATE NOT NULL,
expiration_date DATE NOT NULL DEFAULT '9999-12-31',
is_current_flag BOOLEAN NOT NULL DEFAULT TRUE,
source_hash CHAR(64),
created_datetime TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
CONSTRAINT pk_dim_member_scd2 PRIMARY KEY (member_key),
CONSTRAINT uq_dim_member_scd2_natural UNIQUE (member_id, effective_date)
);
-- Index for point-in-time lookups
CREATE INDEX idx_member_scd2_current
ON dim_member_scd2 (member_id, is_current_flag);
Versioning DDL Scripts
DDL scripts need version control the same way application code does. Structure your migration files so they are replayable in order:
migrations/
V001__create_dim_member.sql
V002__create_fct_claims.sql
V003__add_phi_tags_dim_member.sql
V004__add_index_claims_provider.sql
V005__alter_claims_add_drg_code.sql
Each migration file should be idempotent where possible — using CREATE TABLE IF NOT EXISTS, CREATE INDEX IF NOT EXISTS, and ALTER TABLE ... ADD COLUMN IF NOT EXISTS.
Frequently Asked Questions
Should warehouse tables use surrogate keys or natural keys as primary keys?
Use surrogate keys (auto-incrementing integers or GUIDs) as primary keys in analytical layer tables. Natural keys belong in a separate unique constraint. This pattern keeps join performance predictable and handles natural key changes (like NPI deactivations) without cascading updates.
How long should VARCHAR columns be?
Define VARCHAR lengths based on the domain, not a default of 255. Unconstrained VARCHAR(255) accepts invalid data that your application would reject. For code fields, use CHAR(n) for fixed-length codes and VARCHAR with a realistic maximum for variable-length fields.
When should I use BOOLEAN vs CHAR(1) for flag columns?
Use BOOLEAN where your platform supports it (Snowflake, PostgreSQL, Databricks). Use CHAR(1) with a check constraint on platforms where BOOLEAN has limited support or where your ETL source uses character flags.
Operationalizing DDL Quality with mdatool
For healthcare data engineering teams writing and reviewing DDL at scale, mdatool provides tooling at every stage. The mdatool DDL Converter translates DDL across warehouse dialects — converting Snowflake CREATE TABLE scripts to BigQuery, PostgreSQL, or Redshift syntax while preserving data types, constraints, and index definitions. The mdatool Naming Auditor validates every column name in a DDL script against your team's representation term standards before the schema reaches production. The mdatool SQL Linter catches DDL patterns that create downstream query problems — missing NOT NULL constraints, implicit data type coercions, and SELECT star in view definitions. The mdatool Schema Diff generates a precise diff between two DDL versions, making schema review and migration planning tractable at scale.
mdatool Team
The mdatool team builds free engineering tools for healthcare data architects, analysts, and engineers working across payer, provider, and life sciences data.
More in Data Engineering
SQL vs Python for Healthcare Data Analysis, Load, and Storage: A Practical Guide for Data Engineers
SQL and Python are not competitors in healthcare data engineering — they are partners with clearly different responsibilities. SQL owns the warehouse: aggregations, HEDIS queries, claims analysis, and regulatory reporting. Python owns the pipeline: FHIR ingestion, PHI masking, ML model training, and clinical NLP. This guide shows exactly where each language wins, with real healthcare code examples for both.
Read moreTop SQL Linters for Healthcare Data Teams in 2026
SQL linters catch naming violations, style inconsistencies, and structural anti-patterns before they reach production. For healthcare data teams writing claims queries, FHIR pipelines, and risk adjustment models, we ranked the best SQL linters available in 2026.
Read moredbt vs Raw SQL for Healthcare Data Pipelines
Healthcare data teams face a critical choice: write raw SQL directly against source systems, or adopt dbt as a transformation layer. This guide breaks down both approaches — auditability, testability, maintainability — so you can make the right call.
Read moreReady to improve your data architecture?
Free tools for DDL conversion, SQL analysis, naming standards, and more.