Your healthcare data warehouse migrated to Snowflake six months ago. Storage costs are climbing 40% per month. Queries that should take seconds are timing out. Your claims reconciliation report—the one finance runs every morning—just cost $847 to execute.
Snowflake's promise was scalability and simplicity. What happened?
You lifted-and-shifted your on-prem [[[data model](/tools/modeling)](/tools/modeling)](/tools/modeling) without adapting it for Snowflake's architecture. Row-by-row processing patterns that worked in Oracle are killing performance in a columnar database. Clustering keys are missing. Time-travel is inflating storage. Your warehouse is running 24/7 on XXXL when it should auto-suspend.
Here's how to build healthcare data models optimized for Snowflake's architecture—models that leverage clustering, partitioning, materialized views, and cost controls while maintaining the temporal accuracy and audit trails healthcare requires.
The Snowflake Difference: Why Your Oracle Model Doesn't Work
Traditional healthcare data warehouses (Oracle, SQL Server, Teradata) optimize for:
- Row-based storage
- Indexes everywhere
- Partitioning by date ranges
- Manual scaling
Snowflake is fundamentally different:
- Columnar storage (micro-partitions)
- No indexes (automatic clustering)
- Metadata-based pruning
- Auto-scaling compute (virtual warehouses)
- Time-travel and zero-copy cloning
The mistake: Treating Snowflake like Oracle with auto-scaling. It's not. You need to redesign for columnar storage, leverage clustering keys, and optimize for Snowflake's pricing model.
Snowflake Architecture Principles for Healthcare
Principle 1: Separate Storage from Compute
In traditional databases, storage and compute are coupled. In Snowflake, they're independent.
Healthcare implication:
- Store ALL historical claims (10 years+) without compute costs
- Spin up warehouses only when running reports
- Auto-suspend after 60 seconds of inactivity
- Use different warehouse sizes for different workloads
-- ETL warehouse (loads, transformations)
CREATE WAREHOUSE ETL_WH
WAREHOUSE_SIZE = 'LARGE'
AUTO_SUSPEND = 60
AUTO_RESUME = TRUE
INITIALLY_SUSPENDED = TRUE;
🔄Free Tool
Convert this DDL to Snowflake, BigQuery, or PostgreSQL instantly →
-- Reporting warehouse (morning finance reports) CREATE WAREHOUSE REPORTING_WH WAREHOUSE_SIZE = 'MEDIUM' AUTO_SUSPEND = 60 AUTO_RESUME = TRUE INITIALLY_SUSPENDED = TRUE;
-- Analytics warehouse (ad-hoc queries) CREATE WAREHOUSE ANALYTICS_WH WAREHOUSE_SIZE = 'XSMALL' AUTO_SUSPEND = 60 AUTO_RESUME = TRUE INITIALLY_SUSPENDED = TRUE;
**Cost impact:** A warehouse that's always on costs ~$2/hour (Medium). One that auto-suspends costs only when running.
---
### **Principle 2: Leverage Micro-Partitions, Not Traditional Partitioning**
Snowflake automatically divides tables into micro-partitions (50-500MB compressed). You don't create partitions manually.
**But you DO define clustering keys** to organize micro-partitions for query performance.
#### **Bad Healthcare Model (No Clustering)**
```sql
CREATE TABLE claim_header (
clm_id VARCHAR(50),
mmbr_id VARCHAR(50),
service_from_dt DATE,
paid_dt DATE,
current_paid_amt NUMBER(10,2),
-- 50 more columns
);
Problem: Queries filtering by service_from_dt scan ALL micro-partitions because data isn't organized by date.
Good Healthcare Model (Clustered)
CREATE TABLE claim_header (
clm_id VARCHAR(50),
mmbr_id VARCHAR(50),
service_from_dt DATE,
paid_dt DATE,
current_paid_amt NUMBER(10,2),
-- 50 more columns
)
CLUSTER BY (service_from_dt, mmbr_id);
Why this works:
- Most healthcare queries filter by date range (
WHERE service_from_dt BETWEEN ...) - Secondary filter is often member (
AND mmbr_id = '...') - Clustering organizes micro-partitions to prune non-matching partitions
Query performance:
- Without clustering: Scans 10,000 micro-partitions
- With clustering: Scans 23 micro-partitions (99.77% pruned)
Principle 3: Use Materialized Views for Expensive Aggregations
Healthcare reports often aggregate millions of claims. Don't recompute every time.
Expensive Query (Runs Every Morning)
-- Finance: Total paid by month (runs on 100M claims)
SELECT
DATE_TRUNC('month', paid_dt) as paid_month,
COUNT(*) as claim_count,
SUM(current_paid_amt) as total_paid
FROM claim_header
WHERE paid_dt >= '2020-01-01'
GROUP BY DATE_TRUNC('month', paid_dt);
Cost: $12 per run (scans 100M rows)
Optimized: Materialized View
CREATE MATERIALIZED VIEW claim_monthly_summary AS
SELECT
DATE_TRUNC('month', paid_dt) as paid_month,
COUNT(*) as claim_count,
SUM(current_paid_amt) as total_paid
FROM claim_header
WHERE paid_dt >= '2020-01-01'
GROUP BY DATE_TRUNC('month', paid_dt);
-- Query the materialized view instead
SELECT * FROM claim_monthly_summary
WHERE paid_month = '2024-03-01';
Cost: $0.02 per run (queries pre-aggregated data)
Maintenance: Auto-refreshed by Snowflake when base table changes
Healthcare-Specific Snowflake Patterns
Pattern 1: Temporal Tables with Clustering
Healthcare needs point-in-time queries. Snowflake's clustering + time-travel enables this.
CREATE TABLE member_enrollment (
enrollment_id VARCHAR(50),
mmbr_id VARCHAR(50),
plan_id VARCHAR(50),
coverage_start_dt DATE,
coverage_end_dt DATE,
effective_dt DATE,
end_dt DATE,
is_current_flag BOOLEAN,
load_dttm TIMESTAMP_NTZ,
CONSTRAINT pk_enrollment PRIMARY KEY (enrollment_id)
)
CLUSTER BY (mmbr_id, effective_dt);
-- Query: Who was enrolled on March 15, 2024?
SELECT mmbr_id, plan_id
FROM member_enrollment
WHERE '2024-03-15' BETWEEN coverage_start_dt
AND COALESCE(coverage_end_dt, '9999-12-31')
AND is_current_flag = TRUE;
Clustering on (mmbr_id, effective_dt) ensures:
- Member-specific queries are fast
- Date range queries prune effectively
- Point-in-time queries scan minimal partitions
Pattern 2: Claim Header/Line with Variant for Semi-Structured Data
[[[[EDI 837](/terms/EDI%20837)](/terms/EDI%20837)](/terms/EDI%20837)](/terms/EDI%20837) claims have 50+ optional fields. Don't create 100 nullable columns. Use VARIANT.
CREATE TABLE claim_header (
clm_id VARCHAR(50),
mmbr_id VARCHAR(50),
service_from_dt DATE,
-- Core financial fields (structured)
current_billed_amt NUMBER(10,2),
current_paid_amt NUMBER(10,2),
-- Optional/variable fields (semi-structured)
additional_data VARIANT,
load_dttm TIMESTAMP_NTZ
)
CLUSTER BY (service_from_dt);
-- Insert with variant
INSERT INTO claim_header VALUES (
'CLM_001',
'M123',
'2024-03-15',
500.00,
450.00,
PARSE_JSON('{
"admission_type": "ELECTIVE",
"discharge_status": "01",
"drg_code": "470",
"attending_physician_npi": "1234567890"
}'),
CURRENT_TIMESTAMP
);
-- Query variant fields
SELECT
clm_id,
current_paid_amt,
additional_data:drg_code::STRING as drg_code,
additional_data:attending_physician_npi::STRING as attending_npi
FROM claim_header
WHERE additional_data:admission_type::STRING = 'ELECTIVE';
Benefits:
- Schema flexibility (add fields without ALTER TABLE)
- Storage efficiency (only stores non-null values)
- Fast JSON parsing (Snowflake optimizes variant queries)
Pattern 3: Incremental Models with Streams and Tasks
Don't reprocess all claims every night. Use Snowflake Streams to track changes.
-- Create stream to track new/changed claims
CREATE STREAM claim_changes_stream
ON TABLE claim_header;
-- Create task to update materialized summary
CREATE TASK update_claim_summary
WAREHOUSE = ETL_WH
SCHEDULE = '60 MINUTE'
WHEN
SYSTEM$STREAM_HAS_DATA('claim_changes_stream')
AS
MERGE INTO claim_monthly_summary t
USING (
SELECT
DATE_TRUNC('month', paid_dt) as paid_month,
COUNT(*) as claim_count,
SUM(current_paid_amt) as total_paid
FROM claim_changes_stream
WHERE METADATA$ACTION = 'INSERT'
GROUP BY DATE_TRUNC('month', paid_dt)
) s
ON t.paid_month = s.paid_month
WHEN MATCHED THEN
UPDATE SET
claim_count = t.claim_count + s.claim_count,
total_paid = t.total_paid + s.total_paid
WHEN NOT MATCHED THEN
INSERT (paid_month, claim_count, total_paid)
VALUES (s.paid_month, s.claim_count, s.total_paid);
Cost savings: Process only changed data, not entire table.
Cost Optimization Strategies
Strategy 1: Right-Size Warehouses by Workload
-- Heavy ETL loads (nightly)
CREATE WAREHOUSE ETL_WH
WAREHOUSE_SIZE = 'XLARGE'
AUTO_SUSPEND = 60;
-- Morning reports (predictable)
CREATE WAREHOUSE REPORTING_WH
WAREHOUSE_SIZE = 'MEDIUM'
AUTO_SUSPEND = 60;
-- Ad-hoc analytics (variable)
CREATE WAREHOUSE ANALYTICS_WH
WAREHOUSE_SIZE = 'XSMALL'
MAX_CLUSTER_COUNT = 3 -- Auto-scale up if needed
AUTO_SUSPEND = 60;
Strategy 2: Time-Travel Retention
Default time-travel is 1 day (Enterprise: 90 days). For most tables, 1 day is enough.
-- Transactional tables: 1 day is sufficient
CREATE TABLE claim_header (
-- columns
) DATA_RETENTION_TIME_IN_DAYS = 1;
-- Critical compliance tables: Extended retention
CREATE TABLE member_hcc (
-- columns
) DATA_RETENTION_TIME_IN_DAYS = 90;
Storage cost: Time-travel stores all versions. 90 days costs 90x more than 1 day.
Strategy 3: Optimize Clustering Maintenance
Clustering isn't free. Monitor and adjust.
-- Check clustering quality
SELECT
SYSTEM$CLUSTERING_INFORMATION('claim_header', '(service_from_dt)');
-- Reclustering consumes credits. Set budget:
ALTER TABLE claim_header
SUSPEND RECLUSTER; -- Disable auto-reclustering
-- Manually recluster during off-hours
ALTER TABLE claim_header
RESUME RECLUSTER;
Healthcare Data Model Architecture in Snowflake
Layer 1: Raw/Staging (Minimal Processing)
CREATE SCHEMA raw;
CREATE TABLE raw.edi_837_claims (
file_id VARCHAR(100),
claim_json VARIANT,
received_dt DATE,
load_dttm TIMESTAMP_NTZ
)
CLUSTER BY (received_dt);
Purpose: Land data as-is. Minimal transformation.
Warehouse: Small (cheap storage, fast loads)
Layer 2: Conformed (Business Logic Applied)
CREATE SCHEMA conformed;
CREATE TABLE conformed.claim_header (
clm_id VARCHAR(50),
mmbr_id VARCHAR(50),
service_from_dt DATE,
current_paid_amt NUMBER(10,2),
-- conformed, validated fields
)
CLUSTER BY (service_from_dt, mmbr_id);
Purpose: Apply business rules, deduplicate, validate.
Warehouse: Medium-Large (transformation-heavy)
Layer 3: Presentation (Optimized for Queries)
CREATE SCHEMA presentation;
CREATE MATERIALIZED VIEW presentation.claim_monthly_summary AS
SELECT
DATE_TRUNC('month', service_from_dt) as service_month,
mmbr_id,
COUNT(*) as claim_count,
SUM(current_paid_amt) as total_paid
FROM conformed.claim_header
GROUP BY DATE_TRUNC('month', service_from_dt), mmbr_id;
Purpose: Pre-aggregated for fast reporting.
Warehouse: Small (queries pre-computed data)
Real-World Example: Monthly Claims Report
Before Snowflake Optimization:
-- Scans 100M rows, runs for 12 minutes, costs $43
SELECT
DATE_TRUNC('month', ch.service_from_dt) as service_month,
COUNT(DISTINCT ch.clm_id) as claim_count,
COUNT(DISTINCT ch.mmbr_id) as member_count,
SUM(ch.current_paid_amt) as total_paid,
AVG(ch.current_paid_amt) as avg_paid
FROM claim_header ch
WHERE ch.service_from_dt >= '2023-01-01'
GROUP BY DATE_TRUNC('month', ch.service_from_dt)
ORDER BY service_month;
After Snowflake Optimization:
-- Uses materialized view, runs in 2 seconds, costs $0.03
SELECT *
FROM presentation.claim_monthly_summary
WHERE service_month >= '2023-01-01'
ORDER BY service_month;
Performance: 360x faster
Cost: 1,433x cheaper
Monitoring and Governance
Query Cost Tracking
-- See warehouse usage by query
SELECT
query_id,
user_name,
warehouse_name,
execution_time,
total_elapsed_time,
bytes_scanned,
credits_used_cloud_services
FROM snowflake.account_usage.query_history
WHERE start_time >= DATEADD(day, -7, CURRENT_TIMESTAMP)
ORDER BY credits_used_cloud_services DESC
LIMIT 100;
Storage Cost Monitoring
-- See table storage costs
SELECT
table_schema,
table_name,
active_bytes / 1024 / 1024 / 1024 as active_gb,
time_travel_bytes / 1024 / 1024 / 1024 as time_travel_gb,
failsafe_bytes / 1024 / 1024 / 1024 as failsafe_gb
FROM snowflake.account_usage.table_storage_metrics
WHERE table_schema = 'CONFORMED'
ORDER BY active_bytes DESC;
The Healthcare Snowflake Data Model Checklist
✅ Architecture
- Separate warehouses by workload (ETL, Reporting, Analytics)
- Auto-suspend after 60 seconds
- Right-size for actual usage
✅ Clustering
- Cluster by date fields (service_dt, paid_dt)
- Secondary cluster by high-cardinality keys (mmbr_id, prvdr_id)
- Monitor clustering quality
✅ Storage
- Use VARIANT for semi-structured EDI data
- Set time-travel retention based on compliance needs
- Avoid over-clustering (costs credits)
✅ Performance
- Materialized views for expensive aggregations
- Streams + Tasks for incremental processing
- Layer data: Raw → Conformed → Presentation
✅ Cost Control
- Monitor query costs weekly
- Identify expensive queries, optimize
- Review warehouse sizes monthly
Conclusion
Snowflake isn't just "cloud Oracle." It's a fundamentally different architecture that rewards columnar thinking, separation of storage/compute, and materialized aggregations.
Build for Snowflake by:
- Clustering on date + high-cardinality keys
- Using materialized views for reports
- Auto-suspending warehouses
- Monitoring costs and optimizing continuously
Your healthcare data model should leverage Snowflake's strengths—not fight against them.
Optimize the architecture. Control the costs.
Related Guides
More in Healthcare Data Modeling
HCC Risk Adjustment Data Model: Building Accurate Risk Score Pipelines in SQL
HCC risk adjustment determines how much revenue a health plan receives for each Medicare Advantage member. The data model behind it — from diagnosis code ingestion to RAF score calculation — is one of the most consequential schemas a payer data engineer will build.
Read moreNPI Number Validation: How to Clean and Enrich Provider Data in Your Database
Invalid NPI numbers in your provider table silently break claims routing, credentialing workflows, and CMS submissions. This guide covers validation approaches, NPPES enrichment patterns, and SQL queries for catching NPI data quality problems before they reach production.
Read moreHEDIS Measure Data Architecture: Building a Compliant Quality Reporting Pipeline
HEDIS measure calculation is deceptively complex. NCQA technical specifications are hundreds of pages long, source data comes from four different systems, and one logic error can affect thousands of members. Here is how to build a pipeline that holds up to NCQA audit.
Read moreFree Tools
Ready to improve your data architecture?
Free tools for DDL conversion, SQL analysis, naming standards, and more.