Data Lake vs Delta Lake vs Data Warehouse vs Data Mart: Complete Guide
Your CTO just asked: "Should we build a data lake or a data warehouse?"
Your architect says: "We need Delta Lake."
Your business analyst wants: "Just give me a data mart."
Everyone's talking about different things.
Here's the reality: Data lake, Delta Lake, data warehouse, and data mart aren't competing options — they're different layers of a modern data architecture, each solving specific problems.
Choosing the wrong one (or forcing one to do everything) costs millions in re-architecture, months of wasted development, and frustrated users who can't get the data they need.
After building 50+ healthcare data platforms, here's what each actually does, when to use them, and how they work together in a real architecture.
The Quick Answer (Before We Dive Deep)
Data Lake: Raw storage for everything (structured, unstructured, semi-structured)
Cost: $23/TB/month (S3)
Use case: "Dump all our data somewhere cheap, we'll figure it out later"
Delta Lake: Data lake + ACID transactions + versioning
Cost: $23/TB/month (storage) + compute
Use case: "We need a data lake that doesn't corrupt files when 10 people write simultaneously"
Data Warehouse: Structured, optimized tables for analytics
Cost: $40-50/TB/month (Snowflake/Redshift)
Use case: "Business analysts need sub-second SQL queries on 100M claims"
Data Mart: Subset of warehouse focused on one department
Cost: Included in warehouse cost
Use case: "Finance needs their own curated dataset, not the entire warehouse"
Data Lake: The Raw Storage Layer
What It Is
A data lake is a repository that stores massive amounts of raw data in its native format until needed.
Think: S3 buckets full of files (CSV, JSON, Parquet, images, PDFs, videos)
Architecture:
Raw Data Sources
↓
Data Lake (S3/ADLS/GCS)
├── /raw/edi-837/2024/*.json (10M claim files)
├── /raw/member-enrollment/*.csv
├── /raw/medical-images/*.dcm (10TB of X-rays)
├── /raw/clinical-notes/*.pdf
└── /raw/iot-devices/*.parquet (real-time vitals)
↓
[Nothing happens automatically - it's just storage]
Strengths
1. Store Everything Cheap
Cost comparison:
- Data Lake (S3): $23/TB/month
- Data Warehouse (Snowflake): $40/TB/month
- Traditional Database (Oracle): $150+/TB/month (with licensing)
Healthcare example:
- 10 years of medical images: 500TB
- S3 cost: $11,500/month
- Snowflake cost: $20,000/month
- Oracle cost: $75,000+/month
Data lakes win for bulk raw storage.
2. Schema-on-Read Flexibility
Don't need to define schema upfront.
Traditional database:
-- Must define schema first CREATE TABLE member ( member_id VARCHAR(50), first_name VARCHAR(100), -- Define all columns upfront ); -- Then insert data INSERT INTO member VALUES (...);
Data lake:
# Just dump files aws s3 cp member_data_2024.csv s3://data-lake/raw/members/ aws s3 cp claims_*.json s3://data-lake/raw/claims/ aws s3 cp x-rays/*.dcm s3://data-lake/raw/images/ # Figure out schema later when you read it
3. Handle Any Data Type
Structured: CSV, Parquet (claims, members)
Semi-structured: JSON, XML, Avro (EDI files, HL7 messages)
Unstructured: PDFs, images, audio, video (medical records, X-rays, doctor notes)
Traditional databases can't store images, PDFs, or raw JSON effectively.
Weaknesses
1. Data Swamp Risk
What happens in reality:
Year 1: "Let's dump everything in S3!"
Year 2: "Where's the member data?"
Year 3: "Nobody knows what's in here anymore"
Year 4: "The data team spends 50% of time searching for data"
No automatic cataloging, governance, or quality checks.
2. No ACID Transactions
Problem:
Engineer A: Writes member_enrollment_2024.parquet
Engineer B: Reads member_enrollment_2024.parquet (same time)
Result: Corrupted file, half-written data, errors
S3 doesn't guarantee atomicity, consistency, isolation, durability (ACID).
3. Performance Issues
Reading raw Parquet from S3:
# Query 100M claims df = spark.read.parquet("s3://lake/raw/claims/") df.filter(col("service_date") >= "2024-01-01") # Problem: Scans ALL files (slow, expensive) # Time: 10-15 minutes # Cost: $50 per query
No indexes, no optimization, full table scans every time.
When to Use a Data Lake
✅ Store massive volumes of raw data cheaply
✅ Archive historical data (10+ years)
✅ Unstructured data (images, PDFs, audio)
✅ "Land and expand" strategy (figure out schema later)
✅ Data science exploration (try things without schema constraints)
❌ Don't use for: Production analytics, BI dashboards, real-time queries
Cost & Implementation
Storage cost: $23/TB/month (S3 standard)
Implementation:
- Time: 1 week (just create S3 buckets and upload)
- Team: 1 data engineer
- Tools: AWS S3, Azure Data Lake Storage, Google Cloud Storage
Healthcare example:
- 100TB raw data (10 years of claims, images, PDFs)
- Storage: $2,300/month
- Setup: 1 week
- Team: 1 engineer
Delta Lake: Data Lake + Structure
What It Is
Delta Lake adds a transaction layer on top of data lakes (S3/ADLS/GCS) to enable ACID transactions, versioning, and schema enforcement.
Think: Data lake that doesn't corrupt when multiple people write simultaneously
Architecture:
Raw Data Sources
↓
Delta Lake (S3 + Delta Transaction Log)
├── /delta/claims/
│ ├── part-0001.parquet (data files)
│ ├── part-0002.parquet
│ └── _delta_log/ (transaction log - the magic)
│ ├── 00000000000000000000.json (version 0)
│ ├── 00000000000000000001.json (version 1)
│ └── 00000000000000000002.json (version 2)
↓
[Guaranteed consistency, time-travel, no corruption]
Strengths
1. ACID Transactions (No More Data Corruption)
Problem with plain data lake:
10 engineers writing to same table simultaneously
→ File corruption
→ Partial reads
→ Lost data
Delta Lake solution:
# Multiple concurrent writes - no corruption df1.write.format("delta").mode("append").save("s3://delta/claims") df2.write.format("delta").mode("append").save("s3://delta/claims") df3.write.format("delta").mode("append").save("s3://delta/claims") # Delta Lake transaction log ensures: # - Atomic writes (all or nothing) # - Consistent reads (always see complete data) # - Isolation (writes don't interfere) # - Durability (committed data never lost)
2. Time Travel (Audit Trail for Free)
Query any historical version:
# Read claims as of yesterday df = spark.read.format("delta").option("versionAsOf", 0).load("s3://delta/claims") # Read claims as of specific timestamp df = spark.read.format("delta").option("timestampAsOf", "2024-03-01").load("s3://delta/claims") # See full history deltaTable.history().show()
Healthcare compliance: Built-in audit trail (required for HIPAA, FDA)
3. Schema Enforcement & Evolution
Problem with plain data lake:
# Day 1: member_id is STRING df1 = [("M123", "John", "Doe")] # Day 2: Someone writes member_id as INT df2 = [(456, "Jane", "Smith")] # Result: Data corruption, mixed types
Delta Lake solution:
# Schema enforced df.write.format("delta").save("s3://delta/members") # Try to write incompatible schema df_bad.write.format("delta").mode("append").save("s3://delta/members") # → ERROR: Schema mismatch # Or evolve schema safely df.write.format("delta").option("mergeSchema", "true").mode("append").save("s3://delta/members")
4. Optimized Performance
Data compaction:
# Small files problem in data lake # 10,000 small files = slow queries # Delta Lake compaction deltaTable.optimize().executeCompaction() # Result: 10,000 files → 100 optimized files # Query time: 10 min → 30 seconds
Z-ordering (clustering):
# Optimize for common query patterns deltaTable.optimize().executeZOrderBy("service_date", "member_id") # Queries filtering by service_date or member_id = 10x faster
Weaknesses
1. Still Not a Data Warehouse
Delta Lake:
- No automatic query optimization
- No automatic clustering (you must run OPTIMIZE)
- No separation of storage/compute (pay for compute even when idle)
- No BI tool integrations (Tableau can't query Delta directly)
Snowflake:
- Automatic optimization
- Auto-clustering
- Separate compute (pay only when querying)
- Native BI integrations
2. Requires Spark Expertise
To use Delta Lake effectively:
from delta.tables import * # Write df.write.format("delta").mode("append").save("s3://delta/claims") # Optimize deltaTable = DeltaTable.forPath(spark, "s3://delta/claims") deltaTable.optimize().executeCompaction() deltaTable.optimize().executeZOrderBy("service_date") # Vacuum old files deltaTable.vacuum(168) # Keep 7 days of history
Your team needs PySpark skills, not just SQL.
3. Cost = Storage + Compute
Cost breakdown:
- Storage: $23/TB/month (S3)
- Compute: Databricks clusters ($0.40/DBU + EC2 costs)
Running queries:
Medium cluster: $8/hour (compute)
Query time: 10 minutes
Cost per query: $1.33
vs Snowflake:
Medium warehouse: $8/hour
Query time: 10 seconds
Cost per query: $0.02
Delta Lake is cheaper for storage, more expensive for ad-hoc queries.
When to Use Delta Lake
✅ Building a lakehouse (data lake + warehouse features)
✅ Need ACID transactions on data lake
✅ Large-scale ETL pipelines (Databricks/Spark)
✅ ML/AI workloads on raw data
✅ Time-travel/audit requirements
✅ Team has Spark expertise
❌ Don't use for: BI dashboards (use warehouse), simple analytics (overkill)
Cost & Implementation
Storage cost: $23/TB/month (S3)
Compute cost: $0.40/DBU + cloud compute
Implementation:
- Time: 2-4 weeks (setup Databricks, migrate from data lake)
- Team: 2-3 data engineers (Spark expertise required)
- Tools: Databricks, Delta Lake, Spark
Healthcare example:
- 50TB curated data in Delta Lake
- Storage: $1,150/month
- Compute (ETL pipelines): $3,000-5,000/month
- Total: ~$4,500/month
- Setup: 3-4 weeks
- Team: 2 Spark engineers
Data Warehouse: Structured Analytics Engine
What It Is
A data warehouse is a centralized repository of structured, cleaned, transformed data optimized for analytics and BI.
Think: SQL database on steroids, designed for fast queries on billions of rows
Architecture:
Raw Data (Data Lake)
↓
ETL Pipeline (Transform, Clean, Model)
↓
Data Warehouse (Snowflake/Redshift/BigQuery)
├── Star Schema
│ ├── dim_member (members table)
│ ├── dim_provider (providers table)
│ ├── dim_date (date dimension)
│ └── fact_claim (100M claim records)
↓
BI Tools (Tableau, Looker, Power BI)
↓
Business Users (SQL queries, dashboards)
Strengths
1. Blazing Fast SQL Queries
Same query:
Data Lake (Parquet on S3):
-- Scan 100M claims SELECT COUNT(*) FROM claims WHERE service_date >= '2024-01-01'; Time: 10 minutes Cost: $5
Delta Lake (Optimized):
-- Scan 100M claims (optimized) SELECT COUNT(*) FROM claims WHERE service_date >= '2024-01-01'; Time: 2 minutes Cost: $1
Data Warehouse (Snowflake):
-- Scan 100M claims (auto-clustered) SELECT COUNT(*) FROM claims WHERE service_date >= '2024-01-01'; Time: 2 seconds Cost: $0.01
100x faster, 100x cheaper for analytical queries.
2. Automatic Optimization
Snowflake example:
-- No manual optimization needed CREATE TABLE claim_header ( claim_id VARCHAR(50), service_date DATE, paid_amount NUMBER(10,2) ); -- Snowflake automatically: -- ✓ Clusters data by insertion order -- ✓ Creates micro-partitions -- ✓ Compresses data -- ✓ Optimizes query plans -- ✓ Caches results -- You just query SELECT * FROM claim_header WHERE service_date = '2024-03-01'; -- → Instant (2 seconds)
Delta Lake:
# Manual optimization required deltaTable.optimize().executeCompaction() deltaTable.optimize().executeZOrderBy("service_date") deltaTable.vacuum(168) # Run these weekly or queries slow down
3. Separation of Storage and Compute
Snowflake architecture:
Storage Layer (always running, always available)
↓
Virtual Warehouse 1 (ETL) - runs 6am-8am
Virtual Warehouse 2 (Reporting) - runs 8am-6pm
Virtual Warehouse 3 (Analytics) - runs as needed
Pay only when virtual warehouses are running
Cost savings:
Traditional database: $50K/month (always running)
Snowflake: $15K/month (auto-suspend when idle)
Savings: $35K/month (70%)
4. Native BI Tool Integration
Tableau → Snowflake:
- One click connection
- Live queries (not extracts)
- Sub-second dashboard refreshes
- No ETL needed
Tableau → Delta Lake:
- Set up JDBC/ODBC connection (complex)
- Start Spark cluster (slow, expensive)
- Query performance inconsistent
- Requires Databricks SQL endpoint ($$$)
5. Time-Travel and Zero-Copy Cloning
Snowflake time-travel:
-- Query data as of 1 hour ago SELECT * FROM claims AT (OFFSET => -3600); -- Restore accidentally deleted data CREATE TABLE claims_restored CLONE claims AT (OFFSET => -3600); -- Zero-copy clone (instant, no storage duplication) CREATE TABLE claims_dev CLONE claims; -- → Dev environment in 5 seconds, 0 additional storage cost
Weaknesses
1. Higher Storage Cost
Cost comparison (1TB):
- S3 (Data Lake): $23/month
- Snowflake: $40/month
- Redshift: $50/month
For 100TB:
- S3: $2,300/month
- Snowflake: $4,000/month
- Difference: $1,700/month
But: Query performance often justifies the extra cost
2. Not Great for Unstructured Data
Snowflake is designed for tables:
-- Great CREATE TABLE claim_header (...); -- Possible but awkward CREATE TABLE medical_images ( image_id VARCHAR(50), image_data BINARY -- 10MB X-ray image ); -- → Expensive, slow, not what Snowflake is designed for
Unstructured data belongs in data lake, not warehouse.
3. Vendor Lock-In
Snowflake proprietary features:
- Time-travel
- Zero-copy cloning
- Streams and tasks
- Snowpipe
Migrating off Snowflake = rewrite all these features.
Delta Lake = open-source standard (portable)
When to Use a Data Warehouse
✅ Primary workload is SQL analytics
✅ BI dashboards for business users
✅ Hundreds of analysts running queries
✅ Need sub-second query performance
✅ Team is SQL-based (not Spark engineers)
✅ Ad-hoc analysis requirements
❌ Don't use for: Unstructured data storage, ML training on raw data
Cost & Implementation
Storage cost: $40-50/TB/month (Snowflake/Redshift)
Compute cost: $2-100/hour per warehouse (based on size)
Implementation:
- Time: 8-12 weeks (data modeling, ETL, testing)
- Team: 3-5 data engineers, 1 analytics engineer, 1 data modeler
- Tools: Snowflake/Redshift/BigQuery, dbt, Fivetran/Airbyte
Healthcare example:
- 20TB modeled data (claims, members, providers)
- Storage: $800/month
- Compute (3 warehouses): $8,000/month
- Total: ~$9,000/month
- Setup: 10-12 weeks
- Team: 4 engineers + 1 architect
Data Mart: Focused Subset
What It Is
A data mart is a subset of a data warehouse focused on a specific business function, department, or use case.
Think: Finance's view of the warehouse (only claims, payments, revenue — no clinical data)
Architecture:
Data Warehouse (Enterprise)
├── All Claims (100M records)
├── All Members (10M records)
├── All Providers (500K records)
├── All Encounters (50M records)
└── All Lab Results (200M records)
↓
Data Marts (Departmental Views)
├── Finance Mart
│ ├── Claims (financial columns only)
│ ├── Payments
│ └── Revenue
│
├── Clinical Mart
│ ├── Encounters
│ ├── Diagnoses
│ └── Lab Results
│
└── Operations Mart
├── Provider Performance
├── Network Utilization
└── Capacity Planning
Strengths
1. Simplified Access
Problem:
-- Finance user queries entire warehouse (overwhelming) SELECT * FROM dim_member JOIN fact_claim ON ... JOIN dim_provider ON ... JOIN fact_encounter ON ... JOIN dim_diagnosis ON ... -- → 50 tables, 847 columns, nobody knows what to query
Solution: Finance Data Mart
-- Pre-built views for finance SELECT * FROM finance.monthly_revenue; SELECT * FROM finance.claims_aging; SELECT * FROM finance.provider_payments; -- → 10 tables, 50 columns, all relevant to finance
2. Performance Optimization
Data mart strategies:
A) Materialized views (pre-aggregated):
-- Warehouse: Aggregate on every query (slow) SELECT DATE_TRUNC('month', service_date) as month, SUM(paid_amount) as total_paid FROM fact_claim GROUP BY DATE_TRUNC('month', service_date); -- → 10 seconds each time -- Data Mart: Pre-aggregated materialized view CREATE MATERIALIZED VIEW finance.monthly_paid AS SELECT DATE_TRUNC('month', service_date) as month, SUM(paid_amount) as total_paid FROM fact_claim GROUP BY DATE_TRUNC('month', service_date); -- → Query takes 0.5 seconds (20x faster)
B) Denormalized tables:
-- Warehouse: Normalized (many joins) SELECT m.member_name, p.provider_name, c.paid_amount FROM fact_claim c JOIN dim_member m ON c.member_id = m.member_id JOIN dim_provider p ON c.provider_id = p.provider_id; -- → 5 second query -- Data Mart: Denormalized (no joins) SELECT member_name, provider_name, paid_amount FROM finance.claims_denormalized; -- → 0.2 second query (25x faster)
3. Security & Governance
Row-level security:
-- Finance sees only their data CREATE VIEW finance.claims_view AS SELECT * FROM fact_claim WHERE line_of_business = 'Commercial'; -- Clinical sees different data CREATE VIEW clinical.claims_view AS SELECT * FROM fact_claim WHERE has_clinical_data = TRUE; -- Finance can't see clinical data, vice versa
4. Cost Efficiency
Data mart implementation:
- No additional storage (views on warehouse)
- No separate infrastructure
- Reuses existing warehouse compute
Cost: $0 additional (just warehouse cost)
Weaknesses
1. Maintenance Overhead
10 data marts = 10× maintenance:
Update warehouse schema
→ Update finance mart
→ Update clinical mart
→ Update operations mart
→ Update quality mart
→ Update compliance mart
→ ... (repeat for each mart)
Each mart adds complexity.
2. Data Duplication (If Physically Separate)
Problem:
Warehouse: 20TB
Finance Mart: 5TB (copied from warehouse)
Clinical Mart: 8TB (copied from warehouse)
Operations Mart: 3TB (copied from warehouse)
Total storage: 36TB (80% duplication!)
Solution: Use views, not copies
3. Potential Stale Data
Materialized views:
-- Created Monday 9am CREATE MATERIALIZED VIEW finance.monthly_paid AS ... -- Tuesday 10am: Warehouse updated, mart not refreshed -- Finance users see Monday's data (stale) -- Must refresh REFRESH MATERIALIZED VIEW finance.monthly_paid;
When to Use a Data Mart
✅ Department needs focused dataset
✅ Performance optimization for specific queries
✅ Simplify access for business users
✅ Security/governance requirements
✅ Different departments need different views
❌ Don't use if: Warehouse performance is good enough
Cost & Implementation
Cost: $0 additional (uses warehouse infrastructure)
Implementation:
- Time: 1-2 weeks per mart
- Team: 1 analytics engineer
- Tools: dbt (for views/materialized views)
Healthcare example:
- Finance data mart (5 views, 3 materialized views)
- Setup: 1 week
- Team: 1 analytics engineer
- Cost: $0 (reuses warehouse)
The Complete Modern Architecture
How They Work Together
┌─────────────────────────────────────────────────────────┐
│ RAW DATA SOURCES │
│ EDI Claims • Member Files • Provider Data • Images │
└────────────────────┬────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ DATA LAKE (S3) │
│ Store everything raw • $23/TB/month • Unstructured OK │
│ /raw/claims/*.json /raw/images/*.dcm /raw/pdf/*.pdf │
└────────────────────┬────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ DELTA LAKE (Databricks + S3) │
│ ACID transactions • Time-travel • ML/AI workloads │
│ /delta/claims /delta/members /delta/providers │
│ Used for: ETL processing, ML model training │
└────────────────────┬────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ DATA WAREHOUSE (Snowflake) │
│ Fast SQL queries • BI dashboards • Business analytics │
│ dim_member dim_provider fact_claim fact_encounter │
└────────────────────┬────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ DATA MARTS │
│ ├── Finance: Claims, Payments, Revenue │
│ ├── Clinical: Encounters, Diagnoses, Quality │
│ └── Operations: Network, Utilization, Capacity │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ END USERS │
│ Analysts • Data Scientists • Business Users • Execs │
└─────────────────────────────────────────────────────────┘
Data Flow Example (Healthcare Claim)
Step 1: Raw ingestion → Data Lake
EDI 837 claim file arrives
↓
Store in S3 (Data Lake)
→ s3://data-lake/raw/edi-837/2024-03-25/claim_001.json
→ Cost: $0.023 per GB
→ Time: Instant (just upload)
Step 2: Processing → Delta Lake
# Databricks job reads from S3 raw_claims = spark.read.json("s3://data-lake/raw/edi-837/") # Parse, clean, validate claims_clean = raw_claims \ .withColumn("service_date", to_date(col("service_dt"), "yyyyMMdd")) \ .withColumn("paid_amount", col("charges").cast("decimal(10,2)")) \ .filter(col("claim_status") == "ACCEPTED") # Write to Delta Lake claims_clean.write.format("delta").mode("append").save("s3://delta/claims") # Cost: $5/hour cluster # Time: 30 minutes for 1M claims
Step 3: Modeling → Data Warehouse
-- Load from Delta Lake into Snowflake COPY INTO claim_header FROM @delta_stage/claims/ FILE_FORMAT = (TYPE = PARQUET); -- Transform into star schema INSERT INTO fact_claim SELECT c.claim_id, m.member_key, -- Lookup dimension key p.provider_key, -- Lookup dimension key c.service_date, c.paid_amount FROM claim_header c JOIN dim_member m ON c.member_id = m.member_id JOIN dim_provider p ON c.provider_id = p.provider_id; -- Cost: $0.50 for 1M claims -- Time: 5 minutes
Step 4: Data Marts
-- Finance Mart (materialized view) CREATE MATERIALIZED VIEW finance.daily_revenue AS SELECT service_date, COUNT(*) as claim_count, SUM(paid_amount) as total_paid FROM fact_claim GROUP BY service_date; -- Cost: $0 (no additional storage) -- Time: 30 seconds to create
Step 5: Business User Query
-- Finance analyst queries data mart SELECT * FROM finance.daily_revenue WHERE service_date = '2024-03-25'; -- Query time: 0.5 seconds -- Cost: $0.01
Cost Comparison (Large Health Plan)
Scenario: 5M Members, 50M Claims/Year
Data volumes:
- Raw data: 100TB (10 years)
- Delta Lake: 30TB (curated)
- Warehouse: 10TB (modeled)
Monthly costs:
Architecture 1: Data Lake Only (Cheapest, Worst Performance)
S3 storage: 100TB × $23 = $2,300/month
Compute (Spark): $5,000/month (ETL + queries)
Total: $7,300/month
Performance:
- Queries: 5-10 minutes
- BI dashboards: Unusable
- User satisfaction: 2/10
Architecture 2: Data Lake + Warehouse (Balanced)
S3 storage: 100TB × $23 = $2,300/month
Snowflake storage: 10TB × $40 = $400/month
Snowflake compute: $10,000/month
Total: $12,700/month
Performance:
- Queries: 1-5 seconds
- BI dashboards: Fast
- User satisfaction: 9/10
Architecture 3: Full Modern Stack (Best, Most Expensive)
S3 storage: 100TB × $23 = $2,300/month
Delta Lake compute: $8,000/month (Databricks)
Snowflake storage: 10TB × $40 = $400/month
Snowflake compute: $10,000/month
Total: $20,700/month
Performance:
- Data Lake queries: 2-5 minutes (Spark)
- Warehouse queries: 1-5 seconds
- ML workloads: Optimized (Delta)
- BI dashboards: Fast
- User satisfaction: 10/10
Value: Best architecture for mixed workloads
Decision Framework
Choose Data Lake When:
✅ Storing massive raw data (TBs-PBs)
✅ Unstructured data (images, PDFs, audio)
✅ Archive/backup requirements
✅ Future use cases undefined
✅ Cost is primary concern
✅ Performance less critical
Healthcare fit: Medical image storage, document archives, raw EDI files
Cost: $23/TB/month
Team: 1 engineer
Time: 1 week
Choose Delta Lake When:
✅ Need ACID transactions on data lake
✅ Building lakehouse architecture
✅ Large-scale Spark/ETL pipelines
✅ ML/AI workloads on raw data
✅ Time-travel/audit requirements
✅ Team has Spark expertise
Healthcare fit: Claims processing pipelines, ML model training, clinical data ETL
Cost: $23/TB/month (storage) + $5-10K/month (compute)
Team: 2-3 Spark engineers
Time: 3-4 weeks
Choose Data Warehouse When:
✅ Primary workload is SQL analytics
✅ BI dashboards for business users
✅ Sub-second query performance needed
✅ Team is SQL-based (not Spark)
✅ Hundreds of concurrent users
✅ Ad-hoc analysis requirements
Healthcare fit: Financial reporting, HEDIS analytics, executive dashboards, operational reports
Cost: $40/TB/month (storage) + $5-15K/month (compute)
Team: 3-5 engineers + analyst
Time: 10-12 weeks
Choose Data Marts When:
✅ Department needs focused dataset
✅ Performance optimization needed
✅ Simplify access for specific users
✅ Security/governance requirements
✅ Different views for different teams
Healthcare fit: Finance views, clinical quality views, operations views
Cost: $0 additional (uses warehouse)
Team: 1 analytics engineer
Time: 1-2 weeks per mart
Real Healthcare Implementation
Scenario: Regional Health Plan (2M Members)
Phase 1: Data Lake (Week 1-2)
Setup S3 buckets
Ingest 5 years of historical data (50TB)
Cost: $1,150/month storage
Team: 1 engineer
Phase 2: Delta Lake (Week 3-6)
Setup Databricks workspace
Migrate critical datasets to Delta
Build ETL pipelines (claims, members)
Cost: $4,500/month (storage + compute)
Team: 2 Spark engineers
Phase 3: Data Warehouse (Week 7-18)
Design star schema (claims, members, providers)
Build dbt models
Set up Snowflake
Load 2 years of data (5TB)
Cost: $6,000/month (storage + compute)
Team: 3 engineers + 1 architect
Phase 4: Data Marts (Week 19-22)
Finance mart (claims, payments)
Clinical mart (encounters, quality)
Operations mart (network, utilization)
Cost: $0 additional
Team: 1 analytics engineer
Total implementation:
- Timeline: 22 weeks (~5 months)
- Team: 5-7 people
- Monthly ongoing cost: ~$11,650
- One-time cost: ~$500K (labor)
Conclusion
Data Lake, Delta Lake, Data Warehouse, and Data Mart aren't competing options — they're layers of a modern data architecture.
Data Lake: Cheap raw storage
Delta Lake: Structured data lake with ACID
Data Warehouse: Fast SQL analytics
Data Mart: Focused departmental views
Most healthcare organizations need all four:
- Data Lake for raw storage ($2K/month)
- Delta Lake for ETL/ML ($5K/month)
- Data Warehouse for analytics ($6K/month)
- Data Marts for departments ($0 additional)
Total: ~$13K/month for enterprise-grade data platform
Choose based on workload, not vendor marketing.
The right architecture uses each layer for what it does best.