mdatool
LibraryBlogPricing
mdatool
mdatool

Healthcare data architecture platform for data engineers, architects, and analysts building modern health systems.

HIPAA-AlignedEnterprise Ready

Tools

  • SQL Linter
  • DDL Converter
  • Bulk Sanitizer
  • Naming Auditor
  • Name Generator
  • AI Data Modeling
  • HCC Calculator

Library

  • Glossary
  • Guides
  • Blog

Company

  • About
  • Contact
  • Pricing

Account

  • Sign Up Free
  • Sign In
  • Upgrade to Pro
  • Dashboard

Legal

  • Privacy Policy
  • Terms of Service

© 2026 mdatool. All rights reserved.

Built for healthcare data engineers & architects.

BlogData EngineeringHow to Write a DDL Script for Healthcare Data Warehouses: Best Practices and Examples
Data Engineering

How to Write a DDL Script for Healthcare Data Warehouses: Best Practices and Examples

A DDL script is more than CREATE TABLE. In a healthcare data warehouse, it is your schema contract — defining data types, constraints, and indexes that determine whether your claims pipeline loads clean or fails silently.

mdatool Team·April 29, 2026·11 min read
DDLdata warehouseCREATE TABLEhealthcare dataschema design

Introduction

A poorly written DDL script is a slow-motion data quality problem. The VARCHAR(255) on an NPI column allows a 50-character typo to load without error. The missing NOT NULL on service_date lets null dates propagate into HEDIS measure calculations. The absent unique constraint on member_id lets a deduplication bug create duplicate member records that survive for months before anyone notices.

🏥

Free Tool

Look up any NPI number instantly →

In healthcare data warehouses, DDL is a correctness control. The schema you define determines what the database will and will not accept — which means schema decisions made in 20 minutes at the start of a project enforce (or fail to enforce) data quality for the life of the warehouse.

This guide covers practical DDL patterns for healthcare data warehouses: data type selection, constraint design, indexing strategy, PHI field handling, and the structural conventions that make a schema maintainable at scale.


Data Type Selection for Healthcare Fields

Choosing the right data type for each column is the highest-leverage decision in your DDL. Wrong data types cause silent truncation, failed joins, and performance problems that are expensive to fix after data is loaded.

Dates and Times

Healthcare data has three distinct temporal concepts that require different types:

-- Service dates: DATE is almost always correct
-- Never use VARCHAR for a date column
service_date            DATE         NOT NULL,
discharge_date          DATE,

-- Timestamps for audit fields and real-time events
created_datetime        TIMESTAMP    NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_datetime        TIMESTAMP    NOT NULL DEFAULT CURRENT_TIMESTAMP,
🔄

Free Tool

Convert this DDL to Snowflake, BigQuery, or PostgreSQL instantly →

-- Fiscal and reporting periods: store as DATE, not VARCHAR('2026Q1') reporting_period_date DATE NOT NULL,

Avoid storing dates as VARCHAR or INTEGER (YYYYMMDD format). Dates stored as strings do not sort correctly, do not support date arithmetic, and accept invalid values like `'2026-02-30'`. ### Monetary Amounts ```sql -- Use DECIMAL/NUMERIC for monetary amounts — never FLOAT -- FLOAT has rounding errors that accumulate in aggregate queries paid_amount DECIMAL(12, 2) NOT NULL DEFAULT 0.00, billed_amount DECIMAL(12, 2) NOT NULL DEFAULT 0.00, allowed_amount DECIMAL(12, 2), copay_amount DECIMAL(10, 2) NOT NULL DEFAULT 0.00,

DECIMAL(12, 2) gives you up to $9.99 billion with two decimal places — sufficient for claim-level amounts. For capitation and actuarial calculations that require more precision, use DECIMAL(18, 6).

Code Fields

Healthcare code fields have well-defined formats. Encode the format in the data type:

-- NPI: exactly 10 digits, stored as CHAR not VARCHAR
npi_id                  CHAR(10)         NOT NULL,

-- [ICD-10](/terms/icd-10) codes: up to 7 characters (e.g., 'A15.0', 'Z23')
diagnosis_code          VARCHAR(10)      NOT NULL,

-- [CPT](/terms/cpt) codes: 5 characters (numeric or alphanumeric)
procedure_code          VARCHAR(5),

-- [NDC](/terms/ndc) codes: 11 digits
ndc_code                CHAR(11),

-- Plan IDs and member IDs vary by payer — use VARCHAR with a reasonable cap
member_id               VARCHAR(50)      NOT NULL,
plan_id                 VARCHAR(20)      NOT NULL,

Use CHAR for fixed-length codes (NPI, NDC). It communicates the expected length and in some platforms improves storage efficiency for fixed-width values.


Constraint Design

Constraints are the most underused feature in healthcare warehouse DDL. Many teams write CREATE TABLE with no constraints other than a PRIMARY KEY. This is a mistake.

Primary Keys

-- Surrogate key pattern — preferred for warehouse tables
CREATE TABLE fct_claims (
    claim_key           BIGINT           NOT NULL GENERATED ALWAYS AS IDENTITY,
    claim_id            VARCHAR(50)      NOT NULL,
    claim_line_number   SMALLINT         NOT NULL,

    CONSTRAINT pk_fct_claims PRIMARY KEY (claim_key),
    CONSTRAINT uq_fct_claims_natural_key UNIQUE (claim_id, claim_line_number)
);

Name every constraint. A named constraint appears in error messages and migration scripts by name — making debugging and schema diffs readable.

NOT NULL Constraints

Apply NOT NULL to every column that should not be null. In healthcare data, that is most columns:

CREATE TABLE dim_member (
    member_key          BIGINT           NOT NULL GENERATED ALWAYS AS IDENTITY,
    member_id           VARCHAR(50)      NOT NULL,
    member_first_name   VARCHAR(100)     NOT NULL,
    member_last_name    VARCHAR(100)     NOT NULL,
    member_dob_date     DATE             NOT NULL,
    member_gender_code  CHAR(1)          NOT NULL,
    plan_id             VARCHAR(20)      NOT NULL,
    effective_date      DATE             NOT NULL,
    termination_date    DATE,                         -- NULL means currently active
    is_deleted_flag     BOOLEAN          NOT NULL DEFAULT FALSE,
    created_datetime    TIMESTAMP        NOT NULL DEFAULT CURRENT_TIMESTAMP,
    updated_datetime    TIMESTAMP        NOT NULL DEFAULT CURRENT_TIMESTAMP,

    CONSTRAINT pk_dim_member PRIMARY KEY (member_key),
    CONSTRAINT uq_dim_member_id UNIQUE (member_id)
);

Check Constraints

Check constraints enforce business rules at the database level:

ALTER TABLE fct_claims
    ADD CONSTRAINT chk_claims_paid_amount
        CHECK (paid_amount >= 0),
    ADD CONSTRAINT chk_claims_billed_ge_paid
        CHECK (billed_amount >= paid_amount OR billed_amount IS NULL),
    ADD CONSTRAINT chk_claims_service_date_range
        CHECK (service_date >= '2000-01-01'),
    ADD CONSTRAINT chk_claims_gender
        CHECK (member_gender_code IN ('M', 'F', 'U'));

Foreign Keys

Foreign keys are often omitted in analytical warehouses for performance reasons — but they should be present in your DDL as documentation and enforced in staging layers:

ALTER TABLE fct_claims
    ADD CONSTRAINT fk_claims_member
        FOREIGN KEY (member_key) REFERENCES dim_member (member_key),
    ADD CONSTRAINT fk_claims_provider
        FOREIGN KEY (rendering_provider_key) REFERENCES dim_provider (provider_key);

If your warehouse does not enforce foreign keys at runtime (Snowflake and Redshift support declarative FK definitions without enforcement), define them anyway. ERD tools, lineage platforms, and documentation generators read them.


Indexing Strategy for Healthcare Queries

Healthcare analytical queries have predictable access patterns. Your index strategy should reflect them.

Claims Tables

Claims tables are the most frequently queried objects in a payer warehouse. Index for the three most common access patterns:

-- Primary filter: member + date range
CREATE INDEX idx_claims_member_service_date
    ON fct_claims (member_key, service_date);

-- Secondary filter: provider + date range
CREATE INDEX idx_claims_provider_service_date
    ON fct_claims (rendering_provider_key, service_date);

-- Lookups by claim ID
CREATE INDEX idx_claims_claim_id
    ON fct_claims (claim_id);

-- Diagnosis code lookups for clinical analytics and [HCC](/terms/hcc)
CREATE INDEX idx_claims_diagnosis_code
    ON fct_claims (primary_diagnosis_code);

Eligibility Tables

Eligibility queries almost always filter on a date range to find active coverage:

-- Point-in-time eligibility lookup
CREATE INDEX idx_eligibility_member_dates
    ON fct_member_eligibility (member_id, effective_date, termination_date);

Code Reference Tables

Reference tables (ICD-10, CPT, NDC) are small but queried on every claim load. Index the code column:

CREATE UNIQUE INDEX uq_ref_icd10_code ON ref_icd10 (icd10_code);
CREATE UNIQUE INDEX uq_ref_cpt_code ON ref_cpt (cpt_code);

Handling PHI Fields in DDL

PHI columns require explicit handling at the DDL level — not just in application code.

Column-Level Comments

Document PHI fields in your DDL with column comments that your governance tooling can read:

-- Snowflake column comment syntax
ALTER TABLE dim_member MODIFY COLUMN member_ssn_phi
    COMMENT 'PHI: Social Security Number. Masked for all roles except compliance_admin.';

ALTER TABLE dim_member MODIFY COLUMN member_dob_date
    COMMENT 'PHI: Date of Birth. Generalized to birth year for roles below analyst_phi tier.';

Data Classification Tags

In platforms that support it, tag PHI columns at the DDL stage:

-- Snowflake object tagging
ALTER TABLE dim_member MODIFY COLUMN member_ssn_phi
    SET TAG governance.phi_classification = 'SSN';

ALTER TABLE dim_member MODIFY COLUMN member_dob_date
    SET TAG governance.phi_classification = 'DOB';

This approach means masking policies can target the tag rather than hardcoding column names — surviving column renames.


DDL for Slowly Changing Dimensions in Healthcare

Member enrollment, provider rosters, and plan configurations change over time. Use Type 2 SCD patterns:

CREATE TABLE dim_member_scd2 (
    member_key              BIGINT       NOT NULL GENERATED ALWAYS AS IDENTITY,
    member_id               VARCHAR(50)  NOT NULL,
    member_first_name       VARCHAR(100) NOT NULL,
    member_last_name        VARCHAR(100) NOT NULL,
    member_dob_date         DATE         NOT NULL,
    plan_id                 VARCHAR(20)  NOT NULL,

    -- SCD2 control columns
    effective_date          DATE         NOT NULL,
    expiration_date         DATE         NOT NULL DEFAULT '9999-12-31',
    is_current_flag         BOOLEAN      NOT NULL DEFAULT TRUE,
    source_hash             CHAR(64),

    created_datetime        TIMESTAMP    NOT NULL DEFAULT CURRENT_TIMESTAMP,

    CONSTRAINT pk_dim_member_scd2 PRIMARY KEY (member_key),
    CONSTRAINT uq_dim_member_scd2_natural UNIQUE (member_id, effective_date)
);

-- Index for point-in-time lookups
CREATE INDEX idx_member_scd2_current
    ON dim_member_scd2 (member_id, is_current_flag);

Versioning DDL Scripts

DDL scripts need version control the same way application code does. Structure your migration files so they are replayable in order:

migrations/ V001__create_dim_member.sql V002__create_fct_claims.sql V003__add_phi_tags_dim_member.sql V004__add_index_claims_provider.sql V005__alter_claims_add_drg_code.sql

Each migration file should be idempotent where possible — using CREATE TABLE IF NOT EXISTS, CREATE INDEX IF NOT EXISTS, and ALTER TABLE ... ADD COLUMN IF NOT EXISTS.


Frequently Asked Questions

Should warehouse tables use surrogate keys or natural keys as primary keys?

Use surrogate keys (auto-incrementing integers or GUIDs) as primary keys in analytical layer tables. Natural keys belong in a separate unique constraint. This pattern keeps join performance predictable and handles natural key changes (like NPI deactivations) without cascading updates.

How long should VARCHAR columns be?

Define VARCHAR lengths based on the domain, not a default of 255. Unconstrained VARCHAR(255) accepts invalid data that your application would reject. For code fields, use CHAR(n) for fixed-length codes and VARCHAR with a realistic maximum for variable-length fields.

When should I use BOOLEAN vs CHAR(1) for flag columns?

Use BOOLEAN where your platform supports it (Snowflake, PostgreSQL, Databricks). Use CHAR(1) with a check constraint on platforms where BOOLEAN has limited support or where your ETL source uses character flags.


Operationalizing DDL Quality with mdatool

For healthcare data engineering teams writing and reviewing DDL at scale, mdatool provides tooling at every stage. The mdatool DDL Converter translates DDL across warehouse dialects — converting Snowflake CREATE TABLE scripts to BigQuery, PostgreSQL, or Redshift syntax while preserving data types, constraints, and index definitions. The mdatool Naming Auditor validates every column name in a DDL script against your team's representation term standards before the schema reaches production. The mdatool SQL Linter catches DDL patterns that create downstream query problems — missing NOT NULL constraints, implicit data type coercions, and SELECT star in view definitions. The mdatool Schema Diff generates a precise diff between two DDL versions, making schema review and migration planning tractable at scale.

M

mdatool Team

The mdatool team builds free engineering tools for healthcare data architects, analysts, and engineers working across payer, provider, and life sciences data.

More in Data Engineering

SQL vs Python for Healthcare Data Analysis, Load, and Storage: A Practical Guide for Data Engineers

SQL and Python are not competitors in healthcare data engineering — they are partners with clearly different responsibilities. SQL owns the warehouse: aggregations, HEDIS queries, claims analysis, and regulatory reporting. Python owns the pipeline: FHIR ingestion, PHI masking, ML model training, and clinical NLP. This guide shows exactly where each language wins, with real healthcare code examples for both.

Read more

Top SQL Linters for Healthcare Data Teams in 2026

SQL linters catch naming violations, style inconsistencies, and structural anti-patterns before they reach production. For healthcare data teams writing claims queries, FHIR pipelines, and risk adjustment models, we ranked the best SQL linters available in 2026.

Read more

dbt vs Raw SQL for Healthcare Data Pipelines

Healthcare data teams face a critical choice: write raw SQL directly against source systems, or adopt dbt as a transformation layer. This guide breaks down both approaches — auditability, testability, maintainability — so you can make the right call.

Read more

Free Tools

Free DDL Converter

Translate SQL schemas between Snowflake, BigQuery, Oracle, and SQL Server.

Try it free

Ready to improve your data architecture?

Free tools for DDL conversion, SQL analysis, naming standards, and more.

Get Started Free

On this page

  • Introduction
  • Data Type Selection for Healthcare Fields
  • Dates and Times
  • Monetary Amounts
  • Code Fields
  • Constraint Design
  • Primary Keys
  • NOT NULL Constraints
  • Check Constraints
  • Foreign Keys
  • Indexing Strategy for Healthcare Queries
  • Claims Tables
  • Eligibility Tables
  • Code Reference Tables
  • Handling PHI Fields in DDL
  • Column-Level Comments
  • Data Classification Tags
  • DDL for Slowly Changing Dimensions in Healthcare
  • Versioning DDL Scripts
  • Frequently Asked Questions
  • Should warehouse tables use surrogate keys or natural keys as primary keys?
  • How long should VARCHAR columns be?
  • When should I use BOOLEAN vs CHAR(1) for flag columns?
  • Operationalizing DDL Quality with mdatool

Share

Share on XShare on LinkedIn

Engineering Tools

Convert DDL, lint SQL, and audit naming conventions — free.

Explore Tools