Healthcare Data Models in Snowflake: Architecture, Performance, and Cost Optimization
Your healthcare data warehouse migrated to Snowflake six months ago. Storage costs are climbing 40% per month. Queries that should take seconds are timing out. Your claims reconciliation report—the one finance runs every morning—just cost $847 to execute.
Snowflake's promise was scalability and simplicity. What happened?
You lifted-and-shifted your on-prem data model without adapting it for Snowflake's architecture. Row-by-row processing patterns that worked in Oracle are killing performance in a columnar database. Clustering keys are missing. Time-travel is inflating storage. Your warehouse is running 24/7 on XXXL when it should auto-suspend.
Here's how to build healthcare data models optimized for Snowflake's architecture—models that leverage clustering, partitioning, materialized views, and cost controls while maintaining the temporal accuracy and audit trails healthcare requires.
The Snowflake Difference: Why Your Oracle Model Doesn't Work
Traditional healthcare data warehouses (Oracle, SQL Server, Teradata) optimize for:
- Row-based storage
- Indexes everywhere
- Partitioning by date ranges
- Manual scaling
Snowflake is fundamentally different:
- Columnar storage (micro-partitions)
- No indexes (automatic clustering)
- Metadata-based pruning
- Auto-scaling compute (virtual warehouses)
- Time-travel and zero-copy cloning
The mistake: Treating Snowflake like Oracle with auto-scaling. It's not. You need to redesign for columnar storage, leverage clustering keys, and optimize for Snowflake's pricing model.
Snowflake Architecture Principles for Healthcare
Principle 1: Separate Storage from Compute
In traditional databases, storage and compute are coupled. In Snowflake, they're independent.
Healthcare implication:
- Store ALL historical claims (10 years+) without compute costs
- Spin up warehouses only when running reports
- Auto-suspend after 60 seconds of inactivity
- Use different warehouse sizes for different workloads
-- ETL warehouse (loads, transformations) CREATE WAREHOUSE ETL_WH WAREHOUSE_SIZE = 'LARGE' AUTO_SUSPEND = 60 AUTO_RESUME = TRUE INITIALLY_SUSPENDED = TRUE; -- Reporting warehouse (morning finance reports) CREATE WAREHOUSE REPORTING_WH WAREHOUSE_SIZE = 'MEDIUM' AUTO_SUSPEND = 60 AUTO_RESUME = TRUE INITIALLY_SUSPENDED = TRUE; -- Analytics warehouse (ad-hoc queries) CREATE WAREHOUSE ANALYTICS_WH WAREHOUSE_SIZE = 'XSMALL' AUTO_SUSPEND = 60 AUTO_RESUME = TRUE INITIALLY_SUSPENDED = TRUE;
Cost impact: A warehouse that's always on costs ~$2/hour (Medium). One that auto-suspends costs only when running.
Principle 2: Leverage Micro-Partitions, Not Traditional Partitioning
Snowflake automatically divides tables into micro-partitions (50-500MB compressed). You don't create partitions manually.
But you DO define clustering keys to organize micro-partitions for query performance.
Bad Healthcare Model (No Clustering)
CREATE TABLE claim_header ( clm_id VARCHAR(50), mmbr_id VARCHAR(50), service_from_dt DATE, paid_dt DATE, current_paid_amt NUMBER(10,2), -- 50 more columns );
Problem: Queries filtering by service_from_dt scan ALL micro-partitions because data isn't organized by date.
Good Healthcare Model (Clustered)
CREATE TABLE claim_header ( clm_id VARCHAR(50), mmbr_id VARCHAR(50), service_from_dt DATE, paid_dt DATE, current_paid_amt NUMBER(10,2), -- 50 more columns ) CLUSTER BY (service_from_dt, mmbr_id);
Why this works:
- Most healthcare queries filter by date range (
WHERE service_from_dt BETWEEN ...) - Secondary filter is often member (
AND mmbr_id = '...') - Clustering organizes micro-partitions to prune non-matching partitions
Query performance:
- Without clustering: Scans 10,000 micro-partitions
- With clustering: Scans 23 micro-partitions (99.77% pruned)
Principle 3: Use Materialized Views for Expensive Aggregations
Healthcare reports often aggregate millions of claims. Don't recompute every time.
Expensive Query (Runs Every Morning)
-- Finance: Total paid by month (runs on 100M claims) SELECT DATE_TRUNC('month', paid_dt) as paid_month, COUNT(*) as claim_count, SUM(current_paid_amt) as total_paid FROM claim_header WHERE paid_dt >= '2020-01-01' GROUP BY DATE_TRUNC('month', paid_dt);
Cost: $12 per run (scans 100M rows)
Optimized: Materialized View
CREATE MATERIALIZED VIEW claim_monthly_summary AS SELECT DATE_TRUNC('month', paid_dt) as paid_month, COUNT(*) as claim_count, SUM(current_paid_amt) as total_paid FROM claim_header WHERE paid_dt >= '2020-01-01' GROUP BY DATE_TRUNC('month', paid_dt); -- Query the materialized view instead SELECT * FROM claim_monthly_summary WHERE paid_month = '2024-03-01';
Cost: $0.02 per run (queries pre-aggregated data)
Maintenance: Auto-refreshed by Snowflake when base table changes
Healthcare-Specific Snowflake Patterns
Pattern 1: Temporal Tables with Clustering
Healthcare needs point-in-time queries. Snowflake's clustering + time-travel enables this.
CREATE TABLE member_enrollment ( enrollment_id VARCHAR(50), mmbr_id VARCHAR(50), plan_id VARCHAR(50), coverage_start_dt DATE, coverage_end_dt DATE, effective_dt DATE, end_dt DATE, is_current_flag BOOLEAN, load_dttm TIMESTAMP_NTZ, CONSTRAINT pk_enrollment PRIMARY KEY (enrollment_id) ) CLUSTER BY (mmbr_id, effective_dt); -- Query: Who was enrolled on March 15, 2024? SELECT mmbr_id, plan_id FROM member_enrollment WHERE '2024-03-15' BETWEEN coverage_start_dt AND COALESCE(coverage_end_dt, '9999-12-31') AND is_current_flag = TRUE;
Clustering on (mmbr_id, effective_dt) ensures:
- Member-specific queries are fast
- Date range queries prune effectively
- Point-in-time queries scan minimal partitions
Pattern 2: Claim Header/Line with Variant for Semi-Structured Data
EDI 837 claims have 50+ optional fields. Don't create 100 nullable columns. Use VARIANT.
CREATE TABLE claim_header ( clm_id VARCHAR(50), mmbr_id VARCHAR(50), service_from_dt DATE, -- Core financial fields (structured) current_billed_amt NUMBER(10,2), current_paid_amt NUMBER(10,2), -- Optional/variable fields (semi-structured) additional_data VARIANT, load_dttm TIMESTAMP_NTZ ) CLUSTER BY (service_from_dt); -- Insert with variant INSERT INTO claim_header VALUES ( 'CLM_001', 'M123', '2024-03-15', 500.00, 450.00, PARSE_JSON('{ "admission_type": "ELECTIVE", "discharge_status": "01", "drg_code": "470", "attending_physician_npi": "1234567890" }'), CURRENT_TIMESTAMP ); -- Query variant fields SELECT clm_id, current_paid_amt, additional_data:drg_code::STRING as drg_code, additional_data:attending_physician_npi::STRING as attending_npi FROM claim_header WHERE additional_data:admission_type::STRING = 'ELECTIVE';
Benefits:
- Schema flexibility (add fields without ALTER TABLE)
- Storage efficiency (only stores non-null values)
- Fast JSON parsing (Snowflake optimizes variant queries)
Pattern 3: Incremental Models with Streams and Tasks
Don't reprocess all claims every night. Use Snowflake Streams to track changes.
-- Create stream to track new/changed claims CREATE STREAM claim_changes_stream ON TABLE claim_header; -- Create task to update materialized summary CREATE TASK update_claim_summary WAREHOUSE = ETL_WH SCHEDULE = '60 MINUTE' WHEN SYSTEM$STREAM_HAS_DATA('claim_changes_stream') AS MERGE INTO claim_monthly_summary t USING ( SELECT DATE_TRUNC('month', paid_dt) as paid_month, COUNT(*) as claim_count, SUM(current_paid_amt) as total_paid FROM claim_changes_stream WHERE METADATA$ACTION = 'INSERT' GROUP BY DATE_TRUNC('month', paid_dt) ) s ON t.paid_month = s.paid_month WHEN MATCHED THEN UPDATE SET claim_count = t.claim_count + s.claim_count, total_paid = t.total_paid + s.total_paid WHEN NOT MATCHED THEN INSERT (paid_month, claim_count, total_paid) VALUES (s.paid_month, s.claim_count, s.total_paid);
Cost savings: Process only changed data, not entire table.
Cost Optimization Strategies
Strategy 1: Right-Size Warehouses by Workload
-- Heavy ETL loads (nightly) CREATE WAREHOUSE ETL_WH WAREHOUSE_SIZE = 'XLARGE' AUTO_SUSPEND = 60; -- Morning reports (predictable) CREATE WAREHOUSE REPORTING_WH WAREHOUSE_SIZE = 'MEDIUM' AUTO_SUSPEND = 60; -- Ad-hoc analytics (variable) CREATE WAREHOUSE ANALYTICS_WH WAREHOUSE_SIZE = 'XSMALL' MAX_CLUSTER_COUNT = 3 -- Auto-scale up if needed AUTO_SUSPEND = 60;
Strategy 2: Time-Travel Retention
Default time-travel is 1 day (Enterprise: 90 days). For most tables, 1 day is enough.
-- Transactional tables: 1 day is sufficient CREATE TABLE claim_header ( -- columns ) DATA_RETENTION_TIME_IN_DAYS = 1; -- Critical compliance tables: Extended retention CREATE TABLE member_hcc ( -- columns ) DATA_RETENTION_TIME_IN_DAYS = 90;
Storage cost: Time-travel stores all versions. 90 days costs 90x more than 1 day.
Strategy 3: Optimize Clustering Maintenance
Clustering isn't free. Monitor and adjust.
-- Check clustering quality SELECT SYSTEM$CLUSTERING_INFORMATION('claim_header', '(service_from_dt)'); -- Reclustering consumes credits. Set budget: ALTER TABLE claim_header SUSPEND RECLUSTER; -- Disable auto-reclustering -- Manually recluster during off-hours ALTER TABLE claim_header RESUME RECLUSTER;
Healthcare Data Model Architecture in Snowflake
Layer 1: Raw/Staging (Minimal Processing)
CREATE SCHEMA raw; CREATE TABLE raw.edi_837_claims ( file_id VARCHAR(100), claim_json VARIANT, received_dt DATE, load_dttm TIMESTAMP_NTZ ) CLUSTER BY (received_dt);
Purpose: Land data as-is. Minimal transformation.
Warehouse: Small (cheap storage, fast loads)
Layer 2: Conformed (Business Logic Applied)
CREATE SCHEMA conformed; CREATE TABLE conformed.claim_header ( clm_id VARCHAR(50), mmbr_id VARCHAR(50), service_from_dt DATE, current_paid_amt NUMBER(10,2), -- conformed, validated fields ) CLUSTER BY (service_from_dt, mmbr_id);
Purpose: Apply business rules, deduplicate, validate.
Warehouse: Medium-Large (transformation-heavy)
Layer 3: Presentation (Optimized for Queries)
CREATE SCHEMA presentation; CREATE MATERIALIZED VIEW presentation.claim_monthly_summary AS SELECT DATE_TRUNC('month', service_from_dt) as service_month, mmbr_id, COUNT(*) as claim_count, SUM(current_paid_amt) as total_paid FROM conformed.claim_header GROUP BY DATE_TRUNC('month', service_from_dt), mmbr_id;
Purpose: Pre-aggregated for fast reporting.
Warehouse: Small (queries pre-computed data)
Real-World Example: Monthly Claims Report
Before Snowflake Optimization:
-- Scans 100M rows, runs for 12 minutes, costs $43 SELECT DATE_TRUNC('month', ch.service_from_dt) as service_month, COUNT(DISTINCT ch.clm_id) as claim_count, COUNT(DISTINCT ch.mmbr_id) as member_count, SUM(ch.current_paid_amt) as total_paid, AVG(ch.current_paid_amt) as avg_paid FROM claim_header ch WHERE ch.service_from_dt >= '2023-01-01' GROUP BY DATE_TRUNC('month', ch.service_from_dt) ORDER BY service_month;
After Snowflake Optimization:
-- Uses materialized view, runs in 2 seconds, costs $0.03 SELECT * FROM presentation.claim_monthly_summary WHERE service_month >= '2023-01-01' ORDER BY service_month;
Performance: 360x faster
Cost: 1,433x cheaper
Monitoring and Governance
Query Cost Tracking
-- See warehouse usage by query SELECT query_id, user_name, warehouse_name, execution_time, total_elapsed_time, bytes_scanned, credits_used_cloud_services FROM snowflake.account_usage.query_history WHERE start_time >= DATEADD(day, -7, CURRENT_TIMESTAMP) ORDER BY credits_used_cloud_services DESC LIMIT 100;
Storage Cost Monitoring
-- See table storage costs SELECT table_schema, table_name, active_bytes / 1024 / 1024 / 1024 as active_gb, time_travel_bytes / 1024 / 1024 / 1024 as time_travel_gb, failsafe_bytes / 1024 / 1024 / 1024 as failsafe_gb FROM snowflake.account_usage.table_storage_metrics WHERE table_schema = 'CONFORMED' ORDER BY active_bytes DESC;
The Healthcare Snowflake Data Model Checklist
✅ Architecture
- Separate warehouses by workload (ETL, Reporting, Analytics)
- Auto-suspend after 60 seconds
- Right-size for actual usage
✅ Clustering
- Cluster by date fields (service_dt, paid_dt)
- Secondary cluster by high-cardinality keys (mmbr_id, prvdr_id)
- Monitor clustering quality
✅ Storage
- Use VARIANT for semi-structured EDI data
- Set time-travel retention based on compliance needs
- Avoid over-clustering (costs credits)
✅ Performance
- Materialized views for expensive aggregations
- Streams + Tasks for incremental processing
- Layer data: Raw → Conformed → Presentation
✅ Cost Control
- Monitor query costs weekly
- Identify expensive queries, optimize
- Review warehouse sizes monthly
Conclusion
Snowflake isn't just "cloud Oracle." It's a fundamentally different architecture that rewards columnar thinking, separation of storage/compute, and materialized aggregations.
Build for Snowflake by:
- Clustering on date + high-cardinality keys
- Using materialized views for reports
- Auto-suspending warehouses
- Monitoring costs and optimizing continuously
Your healthcare data model should leverage Snowflake's strengths—not fight against them.
Optimize the architecture. Control the costs.