Back to Blog
Healthcare Data Modeling

Healthcare Data Models in Snowflake: Architecture, Performance, and Cost Optimization

mdatool TeamMarch 23, 202610 min read
snowflakehealthcaredata warehouseoptimizationclusteringperformance

Your healthcare data warehouse migrated to Snowflake six months ago. Storage costs are climbing 40% per month. Queries that should take seconds are timing out. Your claims reconciliation report—the one finance runs every morning—just cost $847 to execute.

Snowflake's promise was scalability and simplicity. What happened?

You lifted-and-shifted your on-prem data model without adapting it for Snowflake's architecture. Row-by-row processing patterns that worked in Oracle are killing performance in a columnar database. Clustering keys are missing. Time-travel is inflating storage. Your warehouse is running 24/7 on XXXL when it should auto-suspend.

Here's how to build healthcare data models optimized for Snowflake's architecture—models that leverage clustering, partitioning, materialized views, and cost controls while maintaining the temporal accuracy and audit trails healthcare requires.


The Snowflake Difference: Why Your Oracle Model Doesn't Work

Traditional healthcare data warehouses (Oracle, SQL Server, Teradata) optimize for:

  • Row-based storage
  • Indexes everywhere
  • Partitioning by date ranges
  • Manual scaling

Snowflake is fundamentally different:

  • Columnar storage (micro-partitions)
  • No indexes (automatic clustering)
  • Metadata-based pruning
  • Auto-scaling compute (virtual warehouses)
  • Time-travel and zero-copy cloning

The mistake: Treating Snowflake like Oracle with auto-scaling. It's not. You need to redesign for columnar storage, leverage clustering keys, and optimize for Snowflake's pricing model.


Snowflake Architecture Principles for Healthcare

Principle 1: Separate Storage from Compute

In traditional databases, storage and compute are coupled. In Snowflake, they're independent.

Healthcare implication:

  • Store ALL historical claims (10 years+) without compute costs
  • Spin up warehouses only when running reports
  • Auto-suspend after 60 seconds of inactivity
  • Use different warehouse sizes for different workloads
-- ETL warehouse (loads, transformations)
CREATE WAREHOUSE ETL_WH
  WAREHOUSE_SIZE = 'LARGE'
  AUTO_SUSPEND = 60
  AUTO_RESUME = TRUE
  INITIALLY_SUSPENDED = TRUE;

-- Reporting warehouse (morning finance reports)
CREATE WAREHOUSE REPORTING_WH
  WAREHOUSE_SIZE = 'MEDIUM'
  AUTO_SUSPEND = 60
  AUTO_RESUME = TRUE
  INITIALLY_SUSPENDED = TRUE;

-- Analytics warehouse (ad-hoc queries)
CREATE WAREHOUSE ANALYTICS_WH
  WAREHOUSE_SIZE = 'XSMALL'
  AUTO_SUSPEND = 60
  AUTO_RESUME = TRUE
  INITIALLY_SUSPENDED = TRUE;

Cost impact: A warehouse that's always on costs ~$2/hour (Medium). One that auto-suspends costs only when running.


Principle 2: Leverage Micro-Partitions, Not Traditional Partitioning

Snowflake automatically divides tables into micro-partitions (50-500MB compressed). You don't create partitions manually.

But you DO define clustering keys to organize micro-partitions for query performance.

Bad Healthcare Model (No Clustering)

CREATE TABLE claim_header (
  clm_id                 VARCHAR(50),
  mmbr_id                VARCHAR(50),
  service_from_dt        DATE,
  paid_dt                DATE,
  current_paid_amt       NUMBER(10,2),
  -- 50 more columns
);

Problem: Queries filtering by service_from_dt scan ALL micro-partitions because data isn't organized by date.

Good Healthcare Model (Clustered)

CREATE TABLE claim_header (
  clm_id                 VARCHAR(50),
  mmbr_id                VARCHAR(50),
  service_from_dt        DATE,
  paid_dt                DATE,
  current_paid_amt       NUMBER(10,2),
  -- 50 more columns
)
CLUSTER BY (service_from_dt, mmbr_id);

Why this works:

  • Most healthcare queries filter by date range (WHERE service_from_dt BETWEEN ...)
  • Secondary filter is often member (AND mmbr_id = '...')
  • Clustering organizes micro-partitions to prune non-matching partitions

Query performance:

  • Without clustering: Scans 10,000 micro-partitions
  • With clustering: Scans 23 micro-partitions (99.77% pruned)

Principle 3: Use Materialized Views for Expensive Aggregations

Healthcare reports often aggregate millions of claims. Don't recompute every time.

Expensive Query (Runs Every Morning)

-- Finance: Total paid by month (runs on 100M claims)
SELECT 
  DATE_TRUNC('month', paid_dt) as paid_month,
  COUNT(*) as claim_count,
  SUM(current_paid_amt) as total_paid
FROM claim_header
WHERE paid_dt >= '2020-01-01'
GROUP BY DATE_TRUNC('month', paid_dt);

Cost: $12 per run (scans 100M rows)

Optimized: Materialized View

CREATE MATERIALIZED VIEW claim_monthly_summary AS
SELECT 
  DATE_TRUNC('month', paid_dt) as paid_month,
  COUNT(*) as claim_count,
  SUM(current_paid_amt) as total_paid
FROM claim_header
WHERE paid_dt >= '2020-01-01'
GROUP BY DATE_TRUNC('month', paid_dt);

-- Query the materialized view instead
SELECT * FROM claim_monthly_summary
WHERE paid_month = '2024-03-01';

Cost: $0.02 per run (queries pre-aggregated data)
Maintenance: Auto-refreshed by Snowflake when base table changes


Healthcare-Specific Snowflake Patterns

Pattern 1: Temporal Tables with Clustering

Healthcare needs point-in-time queries. Snowflake's clustering + time-travel enables this.

CREATE TABLE member_enrollment (
  enrollment_id          VARCHAR(50),
  mmbr_id                VARCHAR(50),
  plan_id                VARCHAR(50),
  coverage_start_dt      DATE,
  coverage_end_dt        DATE,
  
  effective_dt           DATE,
  end_dt                 DATE,
  is_current_flag        BOOLEAN,
  
  load_dttm              TIMESTAMP_NTZ,
  
  CONSTRAINT pk_enrollment PRIMARY KEY (enrollment_id)
)
CLUSTER BY (mmbr_id, effective_dt);

-- Query: Who was enrolled on March 15, 2024?
SELECT mmbr_id, plan_id
FROM member_enrollment
WHERE '2024-03-15' BETWEEN coverage_start_dt 
  AND COALESCE(coverage_end_dt, '9999-12-31')
  AND is_current_flag = TRUE;

Clustering on (mmbr_id, effective_dt) ensures:

  • Member-specific queries are fast
  • Date range queries prune effectively
  • Point-in-time queries scan minimal partitions

Pattern 2: Claim Header/Line with Variant for Semi-Structured Data

EDI 837 claims have 50+ optional fields. Don't create 100 nullable columns. Use VARIANT.

CREATE TABLE claim_header (
  clm_id                 VARCHAR(50),
  mmbr_id                VARCHAR(50),
  service_from_dt        DATE,
  
  -- Core financial fields (structured)
  current_billed_amt     NUMBER(10,2),
  current_paid_amt       NUMBER(10,2),
  
  -- Optional/variable fields (semi-structured)
  additional_data        VARIANT,
  
  load_dttm              TIMESTAMP_NTZ
)
CLUSTER BY (service_from_dt);

-- Insert with variant
INSERT INTO claim_header VALUES (
  'CLM_001',
  'M123',
  '2024-03-15',
  500.00,
  450.00,
  PARSE_JSON('{
    "admission_type": "ELECTIVE",
    "discharge_status": "01",
    "drg_code": "470",
    "attending_physician_npi": "1234567890"
  }'),
  CURRENT_TIMESTAMP
);

-- Query variant fields
SELECT 
  clm_id,
  current_paid_amt,
  additional_data:drg_code::STRING as drg_code,
  additional_data:attending_physician_npi::STRING as attending_npi
FROM claim_header
WHERE additional_data:admission_type::STRING = 'ELECTIVE';

Benefits:

  • Schema flexibility (add fields without ALTER TABLE)
  • Storage efficiency (only stores non-null values)
  • Fast JSON parsing (Snowflake optimizes variant queries)

Pattern 3: Incremental Models with Streams and Tasks

Don't reprocess all claims every night. Use Snowflake Streams to track changes.

-- Create stream to track new/changed claims
CREATE STREAM claim_changes_stream 
ON TABLE claim_header;

-- Create task to update materialized summary
CREATE TASK update_claim_summary
  WAREHOUSE = ETL_WH
  SCHEDULE = '60 MINUTE'
WHEN
  SYSTEM$STREAM_HAS_DATA('claim_changes_stream')
AS
  MERGE INTO claim_monthly_summary t
  USING (
    SELECT 
      DATE_TRUNC('month', paid_dt) as paid_month,
      COUNT(*) as claim_count,
      SUM(current_paid_amt) as total_paid
    FROM claim_changes_stream
    WHERE METADATA$ACTION = 'INSERT'
    GROUP BY DATE_TRUNC('month', paid_dt)
  ) s
  ON t.paid_month = s.paid_month
  WHEN MATCHED THEN 
    UPDATE SET 
      claim_count = t.claim_count + s.claim_count,
      total_paid = t.total_paid + s.total_paid
  WHEN NOT MATCHED THEN
    INSERT (paid_month, claim_count, total_paid)
    VALUES (s.paid_month, s.claim_count, s.total_paid);

Cost savings: Process only changed data, not entire table.


Cost Optimization Strategies

Strategy 1: Right-Size Warehouses by Workload

-- Heavy ETL loads (nightly)
CREATE WAREHOUSE ETL_WH 
  WAREHOUSE_SIZE = 'XLARGE'
  AUTO_SUSPEND = 60;

-- Morning reports (predictable)
CREATE WAREHOUSE REPORTING_WH 
  WAREHOUSE_SIZE = 'MEDIUM'
  AUTO_SUSPEND = 60;

-- Ad-hoc analytics (variable)
CREATE WAREHOUSE ANALYTICS_WH 
  WAREHOUSE_SIZE = 'XSMALL'
  MAX_CLUSTER_COUNT = 3  -- Auto-scale up if needed
  AUTO_SUSPEND = 60;

Strategy 2: Time-Travel Retention

Default time-travel is 1 day (Enterprise: 90 days). For most tables, 1 day is enough.

-- Transactional tables: 1 day is sufficient
CREATE TABLE claim_header (
  -- columns
) DATA_RETENTION_TIME_IN_DAYS = 1;

-- Critical compliance tables: Extended retention
CREATE TABLE member_hcc (
  -- columns  
) DATA_RETENTION_TIME_IN_DAYS = 90;

Storage cost: Time-travel stores all versions. 90 days costs 90x more than 1 day.

Strategy 3: Optimize Clustering Maintenance

Clustering isn't free. Monitor and adjust.

-- Check clustering quality
SELECT 
  SYSTEM$CLUSTERING_INFORMATION('claim_header', '(service_from_dt)');

-- Reclustering consumes credits. Set budget:
ALTER TABLE claim_header 
  SUSPEND RECLUSTER;  -- Disable auto-reclustering

-- Manually recluster during off-hours
ALTER TABLE claim_header 
  RESUME RECLUSTER;

Healthcare Data Model Architecture in Snowflake

Layer 1: Raw/Staging (Minimal Processing)

CREATE SCHEMA raw;

CREATE TABLE raw.edi_837_claims (
  file_id       VARCHAR(100),
  claim_json    VARIANT,
  received_dt   DATE,
  load_dttm     TIMESTAMP_NTZ
)
CLUSTER BY (received_dt);

Purpose: Land data as-is. Minimal transformation.
Warehouse: Small (cheap storage, fast loads)


Layer 2: Conformed (Business Logic Applied)

CREATE SCHEMA conformed;

CREATE TABLE conformed.claim_header (
  clm_id                 VARCHAR(50),
  mmbr_id                VARCHAR(50),
  service_from_dt        DATE,
  current_paid_amt       NUMBER(10,2),
  -- conformed, validated fields
)
CLUSTER BY (service_from_dt, mmbr_id);

Purpose: Apply business rules, deduplicate, validate.
Warehouse: Medium-Large (transformation-heavy)


Layer 3: Presentation (Optimized for Queries)

CREATE SCHEMA presentation;

CREATE MATERIALIZED VIEW presentation.claim_monthly_summary AS
SELECT 
  DATE_TRUNC('month', service_from_dt) as service_month,
  mmbr_id,
  COUNT(*) as claim_count,
  SUM(current_paid_amt) as total_paid
FROM conformed.claim_header
GROUP BY DATE_TRUNC('month', service_from_dt), mmbr_id;

Purpose: Pre-aggregated for fast reporting.
Warehouse: Small (queries pre-computed data)


Real-World Example: Monthly Claims Report

Before Snowflake Optimization:

-- Scans 100M rows, runs for 12 minutes, costs $43
SELECT 
  DATE_TRUNC('month', ch.service_from_dt) as service_month,
  COUNT(DISTINCT ch.clm_id) as claim_count,
  COUNT(DISTINCT ch.mmbr_id) as member_count,
  SUM(ch.current_paid_amt) as total_paid,
  AVG(ch.current_paid_amt) as avg_paid
FROM claim_header ch
WHERE ch.service_from_dt >= '2023-01-01'
GROUP BY DATE_TRUNC('month', ch.service_from_dt)
ORDER BY service_month;

After Snowflake Optimization:

-- Uses materialized view, runs in 2 seconds, costs $0.03
SELECT * 
FROM presentation.claim_monthly_summary
WHERE service_month >= '2023-01-01'
ORDER BY service_month;

Performance: 360x faster
Cost: 1,433x cheaper


Monitoring and Governance

Query Cost Tracking

-- See warehouse usage by query
SELECT 
  query_id,
  user_name,
  warehouse_name,
  execution_time,
  total_elapsed_time,
  bytes_scanned,
  credits_used_cloud_services
FROM snowflake.account_usage.query_history
WHERE start_time >= DATEADD(day, -7, CURRENT_TIMESTAMP)
ORDER BY credits_used_cloud_services DESC
LIMIT 100;

Storage Cost Monitoring

-- See table storage costs
SELECT 
  table_schema,
  table_name,
  active_bytes / 1024 / 1024 / 1024 as active_gb,
  time_travel_bytes / 1024 / 1024 / 1024 as time_travel_gb,
  failsafe_bytes / 1024 / 1024 / 1024 as failsafe_gb
FROM snowflake.account_usage.table_storage_metrics
WHERE table_schema = 'CONFORMED'
ORDER BY active_bytes DESC;

The Healthcare Snowflake Data Model Checklist

✅ Architecture

  • Separate warehouses by workload (ETL, Reporting, Analytics)
  • Auto-suspend after 60 seconds
  • Right-size for actual usage

✅ Clustering

  • Cluster by date fields (service_dt, paid_dt)
  • Secondary cluster by high-cardinality keys (mmbr_id, prvdr_id)
  • Monitor clustering quality

✅ Storage

  • Use VARIANT for semi-structured EDI data
  • Set time-travel retention based on compliance needs
  • Avoid over-clustering (costs credits)

✅ Performance

  • Materialized views for expensive aggregations
  • Streams + Tasks for incremental processing
  • Layer data: Raw → Conformed → Presentation

✅ Cost Control

  • Monitor query costs weekly
  • Identify expensive queries, optimize
  • Review warehouse sizes monthly

Conclusion

Snowflake isn't just "cloud Oracle." It's a fundamentally different architecture that rewards columnar thinking, separation of storage/compute, and materialized aggregations.

Build for Snowflake by:

  • Clustering on date + high-cardinality keys
  • Using materialized views for reports
  • Auto-suspending warehouses
  • Monitoring costs and optimizing continuously

Your healthcare data model should leverage Snowflake's strengths—not fight against them.

Optimize the architecture. Control the costs.

Ready to improve your data architecture?

Get started with mdatool's free tools for DDL conversion, SQL analysis, and more.

Get Started Free