Your CTO just asked: "Should we build a data lake or a data warehouse?"
Your architect says: "We need Delta Lake."
Your business analyst wants: "Just give me a data mart."
Everyone's talking about different things.
Here's the reality: Data lake, Delta Lake, data warehouse, and data mart aren't competing options — they're different layers of a modern data architecture, each solving specific problems.
Choosing the wrong one (or forcing one to do everything) costs millions in re-architecture, months of wasted development, and frustrated users who can't get the data they need.
After building 50+ healthcare data platforms, here's what each actually does, when to use them, and how they work together in a real architecture.
The Quick Answer (Before We Dive Deep)
Data Lake: Raw storage for everything (structured, unstructured, semi-structured)
Cost: $23/TB/month (S3)
Use case: "Dump all our data somewhere cheap, we'll figure it out later"
Delta Lake: Data lake + ACID transactions + versioning
Cost: $23/TB/month (storage) + compute
Use case: "We need a data lake that doesn't corrupt files when 10 people write simultaneously"
Data Warehouse: Structured, optimized tables for analytics
Cost: $40-50/TB/month (Snowflake/Redshift)
Use case: "Business analysts need sub-second SQL queries on 100M claims"
Data Mart: Subset of warehouse focused on one department
Cost: Included in warehouse cost
Use case: "Finance needs their own curated dataset, not the entire warehouse"
Data Lake: The Raw Storage Layer
What It Is
A data lake is a repository that stores massive amounts of raw data in its native format until needed.
Think: S3 buckets full of files (CSV, JSON, Parquet, images, PDFs, videos)
Architecture:
Raw Data Sources
↓
Data Lake (S3/ADLS/GCS)
├── /raw/edi-837/2024/*.json (10M claim files)
├── /raw/member-enrollment/*.csv
├── /raw/medical-images/*.dcm (10TB of X-rays)
├── /raw/clinical-notes/*.pdf
└── /raw/iot-devices/*.parquet (real-time vitals)
↓
[Nothing happens automatically - it's just storage]
Strengths
1. Store Everything Cheap
Cost comparison:
- Data Lake (S3): $23/TB/month
- Data Warehouse (Snowflake): $40/TB/month
- Traditional Database (Oracle): $150+/TB/month (with licensing)
Healthcare example:
- 10 years of medical images: 500TB
- S3 cost: $11,500/month
- Snowflake cost: $20,000/month
- Oracle cost: $75,000+/month
Data lakes win for bulk raw storage.
2. Schema-on-Read Flexibility
Don't need to define schema upfront.
Traditional database:
-- Must define schema first
CREATE TABLE member (
member_id VARCHAR(50),
first_name VARCHAR(100),
-- Define all columns upfront
);
-- Then insert data
INSERT INTO member VALUES (...);
Data lake:
# Just dump files
aws s3 cp member_data_2024.csv s3://data-lake/raw/members/
aws s3 cp claims_*.json s3://data-lake/raw/claims/
aws s3 cp x-rays/*.dcm s3://data-lake/raw/images/
# Figure out schema later when you read it
3. Handle Any Data Type
Structured: CSV, Parquet (claims, members)
Semi-structured: JSON, XML, Avro (EDI files, [HL7](https://www.mdatool.com/glossary#HL7) messages)
Unstructured: PDFs, images, audio, video (medical records, X-rays, doctor notes)
Free Tool
Parse this HL7 message →
Traditional databases can't store images, PDFs, or raw JSON effectively.
Weaknesses
1. Data Swamp Risk
What happens in reality:
Year 1: "Let's dump everything in S3!"
Year 2: "Where's the member data?"
Year 3: "Nobody knows what's in here anymore"
Year 4: "The data team spends 50% of time searching for data"
No automatic cataloging, governance, or quality checks.
2. No ACID Transactions
Problem:
Engineer A: Writes member_enrollment_2024.parquet
Engineer B: Reads member_enrollment_2024.parquet (same time)
Result: Corrupted file, half-written data, errors
S3 doesn't guarantee atomicity, consistency, isolation, durability (ACID).
3. Performance Issues
Reading raw Parquet from S3:
# Query 100M claims
df = spark.read.parquet("s3://lake/raw/claims/")
df.filter(col("service_date") >= "2024-01-01")
# Problem: Scans ALL files (slow, expensive)
# Time: 10-15 minutes
# Cost: $50 per query
No indexes, no optimization, full table scans every time.
When to Use a Data Lake
✅ Store massive volumes of raw data cheaply
✅ Archive historical data (10+ years)
✅ Unstructured data (images, PDFs, audio)
✅ "Land and expand" strategy (figure out schema later)
✅ Data science exploration (try things without schema constraints)
❌ Don't use for: Production analytics, BI dashboards, real-time queries
Cost & Implementation
Storage cost: $23/TB/month (S3 standard)
Implementation:
- Time: 1 week (just create S3 buckets and upload)
- Team: 1 data engineer
- Tools: AWS S3, Azure Data Lake Storage, Google Cloud Storage
Healthcare example:
- 100TB raw data (10 years of claims, images, PDFs)
- Storage: $2,300/month
- Setup: 1 week
- Team: 1 engineer
Delta Lake: Data Lake + Structure
What It Is
Delta Lake adds a transaction layer on top of data lakes (S3/ADLS/GCS) to enable ACID transactions, versioning, and schema enforcement.
Think: Data lake that doesn't corrupt when multiple people write simultaneously
Architecture:
Raw Data Sources
↓
Delta Lake (S3 + Delta Transaction Log)
├── /delta/claims/
│ ├── part-0001.parquet (data files)
│ ├── part-0002.parquet
│ └── _delta_log/ (transaction log - the magic)
│ ├── 00000000000000000000.json (version 0)
│ ├── 00000000000000000001.json (version 1)
│ └── 00000000000000000002.json (version 2)
↓
[Guaranteed consistency, time-travel, no corruption]
Strengths
1. ACID Transactions (No More Data Corruption)
Problem with plain data lake:
10 engineers writing to same table simultaneously
→ File corruption
→ Partial reads
→ Lost data
Delta Lake solution:
# Multiple concurrent writes - no corruption
df1.write.format("delta").mode("append").save("s3://delta/claims")
df2.write.format("delta").mode("append").save("s3://delta/claims")
df3.write.format("delta").mode("append").save("s3://delta/claims")
# Delta Lake transaction log ensures:
# - Atomic writes (all or nothing)
# - Consistent reads (always see complete data)
# - Isolation (writes don't interfere)
# - Durability (committed data never lost)
2. Time Travel (Audit Trail for Free)
Query any historical version:
# Read claims as of yesterday
df = spark.read.format("delta").option("versionAsOf", 0).load("s3://delta/claims")
# Read claims as of specific timestamp
df = spark.read.format("delta").option("timestampAsOf", "2024-03-01").load("s3://delta/claims")
# See full history
deltaTable.history().show()
Healthcare compliance: Built-in audit trail (required for [HIPAA](https://www.mdatool.com/glossary#HIPAA), FDA)
3. Schema Enforcement & Evolution
Problem with plain data lake:
# Day 1: member_id is STRING
df1 = [("M123", "John", "Doe")]
# Day 2: Someone writes member_id as INT
df2 = [(456, "Jane", "Smith")]
# Result: Data corruption, mixed types
Delta Lake solution:
# Schema enforced
df.write.format("delta").save("s3://delta/members")
# Try to write incompatible schema
df_bad.write.format("delta").mode("append").save("s3://delta/members")
# → ERROR: Schema mismatch
# Or evolve schema safely
df.write.format("delta").option("mergeSchema", "true").mode("append").save("s3://delta/members")
4. Optimized Performance
Data compaction:
# Small files problem in data lake
# 10,000 small files = slow queries
# Delta Lake compaction
deltaTable.optimize().executeCompaction()
# Result: 10,000 files → 100 optimized files
# Query time: 10 min → 30 seconds
Z-ordering (clustering):
# Optimize for common query patterns
deltaTable.optimize().executeZOrderBy("service_date", "member_id")
# Queries filtering by service_date or member_id = 10x faster
Weaknesses
1. Still Not a Data Warehouse
Delta Lake:
- No automatic query optimization
- No automatic clustering (you must run OPTIMIZE)
- No separation of storage/compute (pay for compute even when idle)
- No BI tool integrations (Tableau can't query Delta directly)
Snowflake:
- Automatic optimization
- Auto-clustering
- Separate compute (pay only when querying)
- Native BI integrations
2. Requires Spark Expertise
To use Delta Lake effectively:
from delta.tables import *
# Write
df.write.format("delta").mode("append").save("s3://delta/claims")
# Optimize
deltaTable = DeltaTable.forPath(spark, "s3://delta/claims")
deltaTable.optimize().executeCompaction()
deltaTable.optimize().executeZOrderBy("service_date")
# Vacuum old files
deltaTable.vacuum(168) # Keep 7 days of history
Your team needs PySpark skills, not just SQL.
3. Cost = Storage + Compute
Cost breakdown:
- Storage: $23/TB/month (S3)
- Compute: Databricks clusters ($0.40/DBU + EC2 costs)
Running queries:
Medium cluster: $8/hour (compute)
Query time: 10 minutes
Cost per query: $1.33
vs Snowflake:
Medium warehouse: $8/hour
Query time: 10 seconds
Cost per query: $0.02
Delta Lake is cheaper for storage, more expensive for ad-hoc queries.
When to Use Delta Lake
✅ Building a lakehouse (data lake + warehouse features)
✅ Need ACID transactions on data lake
✅ Large-scale ETL pipelines (Databricks/Spark)
✅ ML/AI workloads on raw data
✅ Time-travel/audit requirements
✅ Team has Spark expertise
❌ Don't use for: BI dashboards (use warehouse), simple analytics (overkill)
Cost & Implementation
Storage cost: $23/TB/month (S3)
Compute cost: $0.40/DBU + cloud compute
Implementation:
- Time: 2-4 weeks (setup Databricks, migrate from data lake)
- Team: 2-3 data engineers (Spark expertise required)
- Tools: Databricks, Delta Lake, Spark
Healthcare example:
- 50TB curated data in Delta Lake
- Storage: $1,150/month
- Compute (ETL pipelines): $3,000-5,000/month
- Total: ~$4,500/month
- Setup: 3-4 weeks
- Team: 2 Spark engineers
Data Warehouse: Structured Analytics Engine
What It Is
A data warehouse is a centralized repository of structured, cleaned, transformed data optimized for analytics and BI.
Think: SQL database on steroids, designed for fast queries on billions of rows
Architecture:
Raw Data (Data Lake)
↓
ETL Pipeline (Transform, Clean, Model)
↓
Data Warehouse (Snowflake/Redshift/BigQuery)
├── Star Schema
│ ├── dim_member (members table)
│ ├── dim_provider (providers table)
│ ├── dim_date (date dimension)
│ └── fact_claim (100M claim records)
↓
BI Tools (Tableau, Looker, Power BI)
↓
Business Users (SQL queries, dashboards)
Strengths
1. Blazing Fast SQL Queries
Same query:
Data Lake (Parquet on S3):
-- Scan 100M claims
SELECT COUNT(*) FROM claims
WHERE service_date >= '2024-01-01';
Time: 10 minutes
Cost: $5
Delta Lake (Optimized):
-- Scan 100M claims (optimized)
SELECT COUNT(*) FROM claims
WHERE service_date >= '2024-01-01';
Time: 2 minutes
Cost: $1
Data Warehouse (Snowflake):
-- Scan 100M claims (auto-clustered)
SELECT COUNT(*) FROM claims
WHERE service_date >= '2024-01-01';
Time: 2 seconds
Cost: $0.01
100x faster, 100x cheaper for analytical queries.
2. Automatic Optimization
Snowflake example:
-- No manual optimization needed
CREATE TABLE claim_header (
claim_id VARCHAR(50),
service_date DATE,
paid_amount NUMBER(10,2)
);
-- Snowflake automatically:
-- ✓ Clusters data by insertion order
-- ✓ Creates micro-partitions
-- ✓ Compresses data
-- ✓ Optimizes query plans
-- ✓ Caches results
-- You just query
SELECT * FROM claim_header WHERE service_date = '2024-03-01';
-- → Instant (2 seconds)
Delta Lake:
# Manual optimization required
deltaTable.optimize().executeCompaction()
deltaTable.optimize().executeZOrderBy("service_date")
deltaTable.vacuum(168)
# Run these weekly or queries slow down
3. Separation of Storage and Compute
Snowflake architecture:
Storage Layer (always running, always available)
↓
Virtual Warehouse 1 (ETL) - runs 6am-8am
Virtual Warehouse 2 (Reporting) - runs 8am-6pm
Virtual Warehouse 3 (Analytics) - runs as needed
Pay only when virtual warehouses are running
Cost savings:
Traditional database: $50K/month (always running)
Snowflake: $15K/month (auto-suspend when idle)
Savings: $35K/month (70%)
4. Native BI Tool Integration
Tableau → Snowflake:
- One click connection
- Live queries (not extracts)
- Sub-second dashboard refreshes
- No ETL needed
Tableau → Delta Lake:
- Set up JDBC/ODBC connection (complex)
- Start Spark cluster (slow, expensive)
- Query performance inconsistent
- Requires Databricks SQL endpoint ($$$)
5. Time-Travel and Zero-Copy Cloning
Snowflake time-travel:
-- Query data as of 1 hour ago
SELECT * FROM claims AT (OFFSET => -3600);
-- Restore accidentally deleted data
CREATE TABLE claims_restored CLONE claims AT (OFFSET => -3600);
-- Zero-copy clone (instant, no storage duplication)
CREATE TABLE claims_dev CLONE claims;
-- → Dev environment in 5 seconds, 0 additional storage cost
Weaknesses
1. Higher Storage Cost
Cost comparison (1TB):
- S3 (Data Lake): $23/month
- Snowflake: $40/month
- Redshift: $50/month
For 100TB:
- S3: $2,300/month
- Snowflake: $4,000/month
- Difference: $1,700/month
But: Query performance often justifies the extra cost
2. Not Great for Unstructured Data
Snowflake is designed for tables:
-- Great
CREATE TABLE claim_header (...);
-- Possible but awkward
CREATE TABLE medical_images (
image_id VARCHAR(50),
image_data BINARY -- 10MB X-ray image
);
-- → Expensive, slow, not what Snowflake is designed for
Unstructured data belongs in data lake, not warehouse.
3. Vendor Lock-In
Snowflake proprietary features:
- Time-travel
- Zero-copy cloning
- Streams and tasks
- Snowpipe
Migrating off Snowflake = rewrite all these features.
Delta Lake = open-source standard (portable)
When to Use a Data Warehouse
✅ Primary workload is SQL analytics
✅ BI dashboards for business users
✅ Hundreds of analysts running queries
✅ Need sub-second query performance
✅ Team is SQL-based (not Spark engineers)
✅ Ad-hoc analysis requirements
❌ Don't use for: Unstructured data storage, ML training on raw data
Cost & Implementation
Storage cost: $40-50/TB/month (Snowflake/Redshift)
Compute cost: $2-100/hour per warehouse (based on size)
Implementation:
- Time: 8-12 weeks (data modeling, ETL, testing)
- Team: 3-5 data engineers, 1 analytics engineer, 1 data modeler
- Tools: Snowflake/Redshift/BigQuery, dbt, Fivetran/Airbyte
Healthcare example:
- 20TB modeled data (claims, members, providers)
- Storage: $800/month
- Compute (3 warehouses): $8,000/month
- Total: ~$9,000/month
- Setup: 10-12 weeks
- Team: 4 engineers + 1 architect
Data Mart: Focused Subset
What It Is
A data mart is a subset of a data warehouse focused on a specific business function, department, or use case.
Think: Finance's view of the warehouse (only claims, payments, revenue — no clinical data)
Architecture:
Data Warehouse (Enterprise)
├── All Claims (100M records)
├── All Members (10M records)
├── All Providers (500K records)
├── All Encounters (50M records)
└── All Lab Results (200M records)
↓
Data Marts (Departmental Views)
├── Finance Mart
│ ├── Claims (financial columns only)
│ ├── Payments
│ └── Revenue
│
├── Clinical Mart
│ ├── Encounters
│ ├── Diagnoses
│ └── Lab Results
│
└── Operations Mart
├── Provider Performance
├── Network Utilization
└── Capacity Planning
Strengths
1. Simplified Access
Problem:
-- Finance user queries entire warehouse (overwhelming)
SELECT * FROM dim_member
JOIN fact_claim ON ...
JOIN dim_provider ON ...
JOIN fact_encounter ON ...
JOIN dim_diagnosis ON ...
-- → 50 tables, 847 columns, nobody knows what to query
Solution: Finance Data Mart
-- Pre-built views for finance
SELECT * FROM finance.monthly_revenue;
SELECT * FROM finance.claims_aging;
SELECT * FROM finance.provider_payments;
-- → 10 tables, 50 columns, all relevant to finance
2. Performance Optimization
Data mart strategies:
A) Materialized views (pre-aggregated):
-- Warehouse: Aggregate on every query (slow)
SELECT
DATE_TRUNC('month', service_date) as month,
SUM(paid_amount) as total_paid
FROM fact_claim
GROUP BY DATE_TRUNC('month', service_date);
-- → 10 seconds each time
-- Data Mart: Pre-aggregated materialized view
CREATE MATERIALIZED VIEW finance.monthly_paid AS
SELECT
DATE_TRUNC('month', service_date) as month,
SUM(paid_amount) as total_paid
FROM fact_claim
GROUP BY DATE_TRUNC('month', service_date);
-- → Query takes 0.5 seconds (20x faster)
B) Denormalized tables:
-- Warehouse: Normalized (many joins)
SELECT
m.member_name,
p.provider_name,
c.paid_amount
FROM fact_claim c
JOIN dim_member m ON c.member_id = m.member_id
JOIN dim_provider p ON c.provider_id = p.provider_id;
-- → 5 second query
-- Data Mart: Denormalized (no joins)
SELECT
member_name,
provider_name,
paid_amount
FROM finance.claims_denormalized;
-- → 0.2 second query (25x faster)
3. Security & Governance
Row-level security:
-- Finance sees only their data
CREATE VIEW finance.claims_view AS
SELECT * FROM fact_claim
WHERE line_of_business = 'Commercial';
-- Clinical sees different data
CREATE VIEW clinical.claims_view AS
SELECT * FROM fact_claim
WHERE has_clinical_data = TRUE;
-- Finance can't see clinical data, vice versa
4. Cost Efficiency
Data mart implementation:
- No additional storage (views on warehouse)
- No separate infrastructure
- Reuses existing warehouse compute
Cost: $0 additional (just warehouse cost)
Weaknesses
1. Maintenance Overhead
10 data marts = 10× maintenance:
Update warehouse schema
→ Update finance mart
→ Update clinical mart
→ Update operations mart
→ Update quality mart
→ Update compliance mart
→ ... (repeat for each mart)
Each mart adds complexity.
2. Data Duplication (If Physically Separate)
Problem:
Warehouse: 20TB
Finance Mart: 5TB (copied from warehouse)
Clinical Mart: 8TB (copied from warehouse)
Operations Mart: 3TB (copied from warehouse)
Total storage: 36TB (80% duplication!)
Solution: Use views, not copies
3. Potential Stale Data
Materialized views:
-- Created Monday 9am
CREATE MATERIALIZED VIEW finance.monthly_paid AS ...
-- Tuesday 10am: Warehouse updated, mart not refreshed
-- Finance users see Monday's data (stale)
-- Must refresh
REFRESH MATERIALIZED VIEW finance.monthly_paid;
When to Use a Data Mart
✅ Department needs focused dataset
✅ Performance optimization for specific queries
✅ Simplify access for business users
✅ Security/governance requirements
✅ Different departments need different views
❌ Don't use if: Warehouse performance is good enough
Cost & Implementation
Cost: $0 additional (uses warehouse infrastructure)
Implementation:
- Time: 1-2 weeks per mart
- Team: 1 analytics engineer
- Tools: dbt (for views/materialized views)
Healthcare example:
- Finance data mart (5 views, 3 materialized views)
- Setup: 1 week
- Team: 1 analytics engineer
- Cost: $0 (reuses warehouse)
The Complete Modern Architecture
How They Work Together
┌─────────────────────────────────────────────────────────┐
│ RAW DATA SOURCES │
│ EDI Claims • Member Files • Provider Data • Images │
└────────────────────┬────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ DATA LAKE (S3) │
│ Store everything raw • $23/TB/month • Unstructured OK │
│ /raw/claims/*.json /raw/images/*.dcm /raw/pdf/*.pdf │
└────────────────────┬────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ DELTA LAKE (Databricks + S3) │
│ ACID transactions • Time-travel • ML/AI workloads │
│ /delta/claims /delta/members /delta/providers │
│ Used for: ETL processing, ML model training │
└────────────────────┬────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ DATA WAREHOUSE (Snowflake) │
│ Fast SQL queries • BI dashboards • Business analytics │
│ dim_member dim_provider fact_claim fact_encounter │
└────────────────────┬────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ DATA MARTS │
│ ├── Finance: Claims, Payments, Revenue │
│ ├── Clinical: Encounters, Diagnoses, Quality │
│ └── Operations: Network, Utilization, Capacity │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ END USERS │
│ Analysts • Data Scientists • Business Users • Execs │
└─────────────────────────────────────────────────────────┘
Data Flow Example (Healthcare Claim)
Step 1: Raw ingestion → Data Lake
[[EDI 837](/terms/edi-837)](/terms/EDI%20837) claim file arrives
↓
Store in S3 (Data Lake)
→ s3://data-lake/raw/edi-837/2024-03-25/claim_001.json
→ Cost: $0.023 per GB
→ Time: Instant (just upload)
Step 2: Processing → Delta Lake
# Databricks job reads from S3
raw_claims = spark.read.json("s3://data-lake/raw/edi-837/")
# Parse, clean, validate
claims_clean = raw_claims \
.withColumn("service_date", to_date(col("service_dt"), "yyyyMMdd")) \
.withColumn("paid_amount", col("charges").cast("decimal(10,2)")) \
.filter(col("claim_status") == "ACCEPTED")
# Write to Delta Lake
claims_clean.write.format("delta").mode("append").save("s3://delta/claims")
# Cost: $5/hour cluster
# Time: 30 minutes for 1M claims
Step 3: Modeling → Data Warehouse
-- Load from Delta Lake into Snowflake
COPY INTO claim_header
FROM @delta_stage/claims/
FILE_FORMAT = (TYPE = PARQUET);
-- Transform into star schema
INSERT INTO fact_claim
SELECT
c.claim_id,
m.member_key, -- Lookup dimension key
p.provider_key, -- Lookup dimension key
c.service_date,
c.paid_amount
FROM claim_header c
JOIN dim_member m ON c.member_id = m.member_id
JOIN dim_provider p ON c.provider_id = p.provider_id;
-- Cost: $0.50 for 1M claims
-- Time: 5 minutes
Step 4: Data Marts
-- Finance Mart (materialized view)
CREATE MATERIALIZED VIEW finance.daily_revenue AS
SELECT
service_date,
COUNT(*) as claim_count,
SUM(paid_amount) as total_paid
FROM fact_claim
GROUP BY service_date;
-- Cost: $0 (no additional storage)
-- Time: 30 seconds to create
Step 5: Business User Query
-- Finance analyst queries data mart
SELECT * FROM finance.daily_revenue
WHERE service_date = '2024-03-25';
-- Query time: 0.5 seconds
-- Cost: $0.01
Cost Comparison (Large Health Plan)
Scenario: 5M Members, 50M Claims/Year
Data volumes:
- Raw data: 100TB (10 years)
- Delta Lake: 30TB (curated)
- Warehouse: 10TB (modeled)
Monthly costs:
Architecture 1: Data Lake Only (Cheapest, Worst Performance)
S3 storage: 100TB × $23 = $2,300/month
Compute (Spark): $5,000/month (ETL + queries)
Total: $7,300/month
Performance:
- Queries: 5-10 minutes
- BI dashboards: Unusable
- User satisfaction: 2/10
Architecture 2: Data Lake + Warehouse (Balanced)
S3 storage: 100TB × $23 = $2,300/month
Snowflake storage: 10TB × $40 = $400/month
Snowflake compute: $10,000/month
Total: $12,700/month
Performance:
- Queries: 1-5 seconds
- BI dashboards: Fast
- User satisfaction: 9/10
Architecture 3: Full Modern Stack (Best, Most Expensive)
S3 storage: 100TB × $23 = $2,300/month
Delta Lake compute: $8,000/month (Databricks)
Snowflake storage: 10TB × $40 = $400/month
Snowflake compute: $10,000/month
Total: $20,700/month
Performance:
- Data Lake queries: 2-5 minutes (Spark)
- Warehouse queries: 1-5 seconds
- ML workloads: Optimized (Delta)
- BI dashboards: Fast
- User satisfaction: 10/10
Value: Best architecture for mixed workloads
Decision Framework
Choose Data Lake When:
✅ Storing massive raw data (TBs-PBs)
✅ Unstructured data (images, PDFs, audio)
✅ Archive/backup requirements
✅ Future use cases undefined
✅ Cost is primary concern
✅ Performance less critical
Healthcare fit: Medical image storage, document archives, raw EDI files
Cost: $23/TB/month
Team: 1 engineer
Time: 1 week
Choose Delta Lake When:
✅ Need ACID transactions on data lake
✅ Building lakehouse architecture
✅ Large-scale Spark/ETL pipelines
✅ ML/AI workloads on raw data
✅ Time-travel/audit requirements
✅ Team has Spark expertise
Healthcare fit: Claims processing pipelines, ML model training, clinical data ETL
Cost: $23/TB/month (storage) + $5-10K/month (compute)
Team: 2-3 Spark engineers
Time: 3-4 weeks
Choose Data Warehouse When:
✅ Primary workload is SQL analytics
✅ BI dashboards for business users
✅ Sub-second query performance needed
✅ Team is SQL-based (not Spark)
✅ Hundreds of concurrent users
✅ Ad-hoc analysis requirements
Healthcare fit: Financial reporting, [HEDIS](/terms/HEDIS) analytics, executive dashboards, operational reports
Cost: $40/TB/month (storage) + $5-15K/month (compute)
Team: 3-5 engineers + analyst
Time: 10-12 weeks
Choose Data Marts When:
✅ Department needs focused dataset
✅ Performance optimization needed
✅ Simplify access for specific users
✅ Security/governance requirements
✅ Different views for different teams
Healthcare fit: Finance views, clinical quality views, operations views
Cost: $0 additional (uses warehouse)
Team: 1 analytics engineer
Time: 1-2 weeks per mart
Real Healthcare Implementation
Scenario: Regional Health Plan (2M Members)
Phase 1: Data Lake (Week 1-2)
Setup S3 buckets
Ingest 5 years of historical data (50TB)
Cost: $1,150/month storage
Team: 1 engineer
Phase 2: Delta Lake (Week 3-6)
Setup Databricks workspace
Migrate critical datasets to Delta
Build ETL pipelines (claims, members)
Cost: $4,500/month (storage + compute)
Team: 2 Spark engineers
Phase 3: Data Warehouse (Week 7-18)
Design star schema (claims, members, providers)
Build dbt models
Set up Snowflake
Load 2 years of data (5TB)
Cost: $6,000/month (storage + compute)
Team: 3 engineers + 1 architect
Phase 4: Data Marts (Week 19-22)
Finance mart (claims, payments)
Clinical mart (encounters, quality)
Operations mart (network, utilization)
Cost: $0 additional
Team: 1 analytics engineer
Total implementation:
- Timeline: 22 weeks (~5 months)
- Team: 5-7 people
- Monthly ongoing cost: ~$11,650
- One-time cost: ~$500K (labor)
Conclusion
Data Lake, Delta Lake, Data Warehouse, and Data Mart aren't competing options — they're layers of a modern data architecture.
Data Lake: Cheap raw storage
Delta Lake: Structured data lake with ACID
Data Warehouse: Fast SQL analytics
Data Mart: Focused departmental views
Most healthcare organizations need all four:
- Data Lake for raw storage ($2K/month)
- Delta Lake for ETL/ML ($5K/month)
- Data Warehouse for analytics ($6K/month)
- Data Marts for departments ($0 additional)
Total: ~$13K/month for enterprise-grade data platform
Choose based on workload, not vendor marketing.
The right architecture uses each layer for what it does best.
Related Guides
More in Data Architecture
Azure Synapse vs Snowflake for Healthcare Data Architecture: Which Platform Fits Your Team?
Azure Synapse Analytics and Snowflake both promise a unified cloud data platform — but they make different architectural bets that matter enormously in healthcare. This guide compares them across HIPAA compliance, FHIR integration, PHI governance, cost model, and team fit, with concrete SQL examples and a decision framework built for healthcare data engineers.
Read moreOracle vs Databricks for Healthcare Data Architecture: Which Platform Should You Choose?
Oracle brings four decades of enterprise database maturity, deep EHR integration, and a proven HIPAA compliance story. Databricks brings a unified lakehouse, native AI/ML pipelines, and the ability to handle FHIR, HL7, and unstructured clinical data at scale. This guide breaks down which platform wins in each healthcare scenario — and when you need both.
Read moreTelehealth Data Architecture: Complete Guide for Data Engineers (2026)
A complete guide to building a telehealth data architecture — core schema design, HL7 and FHIR integration, HIPAA compliance, HCC risk adjustment, and the common mistakes that cause claim denials.
Read moreFree Tools
Ready to improve your data architecture?
Free tools for DDL conversion, SQL analysis, naming standards, and more.
Get weekly healthcare data engineering tips
Practical guides on data modeling, SQL standards, and healthcare domain conventions — straight to your inbox.
No spam. Unsubscribe any time.