mdatool
Healthcare Data Dictionary for the Modern Data Stack
LibraryBlogPricing
mdatool
mdatool

The healthcare data dictionary for dbt, Snowflake, Databricks, and BigQuery. 100,000+ ISO-11179 standard terms, free SQL tools, and AI data modeling.

HIPAA-AlignedEnterprise Ready

Tools

  • SQL Linter
  • DDL Converter
  • Bulk Sanitizer
  • Naming Auditor
  • Name Generator
  • AI Data Modeling
  • HCC Calculator
  • Data Model Canvas

Library

  • Glossary
  • Guides
  • Blog

Company

  • About
  • Contact
  • Pricing

Account

  • Sign Up Free
  • Sign In
  • Upgrade to Pro
  • Dashboard

Legal

  • Privacy Policy
  • Terms of Service

© 2026 mdatool. All rights reserved.

Built for healthcare data engineers & architects.

HomeBlogData ArchitectureData Lake vs Delta Lake vs Data Warehouse vs Data Mart: Complete Guide
Data Architecture

Data Lake vs Delta Lake vs Data Warehouse vs Data Mart: Complete Guide

Your CTO asks: "Data lake or data warehouse?" Your architect says: "Delta Lake." Your analyst wants: "Just a data mart." Everyone''s confused. Here''s what each actually does, when to use them, and how they work together—with real costs, timelines, and healthcare examples.

mdatool Team·March 25, 2026·15 min read
data lakedelta lakedata warehousedata martarchitecturesnowflakedatabrickshealthcare

Your CTO just asked: "Should we build a data lake or a data warehouse?"

Your architect says: "We need Delta Lake."

Your business analyst wants: "Just give me a data mart."

Everyone's talking about different things.

Here's the reality: Data lake, Delta Lake, data warehouse, and data mart aren't competing options — they're different layers of a modern data architecture, each solving specific problems.

Choosing the wrong one (or forcing one to do everything) costs millions in re-architecture, months of wasted development, and frustrated users who can't get the data they need.

After building 50+ healthcare data platforms, here's what each actually does, when to use them, and how they work together in a real architecture.


The Quick Answer (Before We Dive Deep)

Data Lake: Raw storage for everything (structured, unstructured, semi-structured)
Cost: $23/TB/month (S3)
Use case: "Dump all our data somewhere cheap, we'll figure it out later"

Delta Lake: Data lake + ACID transactions + versioning
Cost: $23/TB/month (storage) + compute
Use case: "We need a data lake that doesn't corrupt files when 10 people write simultaneously"

Data Warehouse: Structured, optimized tables for analytics
Cost: $40-50/TB/month (Snowflake/Redshift)
Use case: "Business analysts need sub-second SQL queries on 100M claims"

Data Mart: Subset of warehouse focused on one department
Cost: Included in warehouse cost
Use case: "Finance needs their own curated dataset, not the entire warehouse"


Data Lake: The Raw Storage Layer

What It Is

A data lake is a repository that stores massive amounts of raw data in its native format until needed.

Think: S3 buckets full of files (CSV, JSON, Parquet, images, PDFs, videos)

Architecture:

Raw Data Sources ↓ Data Lake (S3/ADLS/GCS) ├── /raw/edi-837/2024/*.json (10M claim files) ├── /raw/member-enrollment/*.csv ├── /raw/medical-images/*.dcm (10TB of X-rays) ├── /raw/clinical-notes/*.pdf └── /raw/iot-devices/*.parquet (real-time vitals) ↓ [Nothing happens automatically - it's just storage]

Strengths

1. Store Everything Cheap

Cost comparison:

  • Data Lake (S3): $23/TB/month
  • Data Warehouse (Snowflake): $40/TB/month
  • Traditional Database (Oracle): $150+/TB/month (with licensing)

Healthcare example:

  • 10 years of medical images: 500TB
  • S3 cost: $11,500/month
  • Snowflake cost: $20,000/month
  • Oracle cost: $75,000+/month

Data lakes win for bulk raw storage.

2. Schema-on-Read Flexibility

Don't need to define schema upfront.

Traditional database:

-- Must define schema first
CREATE TABLE member (
  member_id VARCHAR(50),
  first_name VARCHAR(100),
  -- Define all columns upfront
);

-- Then insert data
INSERT INTO member VALUES (...);

Data lake:

# Just dump files
aws s3 cp member_data_2024.csv s3://data-lake/raw/members/
aws s3 cp claims_*.json s3://data-lake/raw/claims/
aws s3 cp x-rays/*.dcm s3://data-lake/raw/images/

# Figure out schema later when you read it

3. Handle Any Data Type

Structured: CSV, Parquet (claims, members)
Semi-structured: JSON, XML, Avro (EDI files, [HL7](https://www.mdatool.com/glossary#HL7) messages)
Unstructured: PDFs, images, audio, video (medical records, X-rays, doctor notes)

📋

Free Tool

Parse this HL7 message →

Traditional databases can't store images, PDFs, or raw JSON effectively.

Weaknesses

1. Data Swamp Risk

What happens in reality:

Year 1: "Let's dump everything in S3!" Year 2: "Where's the member data?" Year 3: "Nobody knows what's in here anymore" Year 4: "The data team spends 50% of time searching for data"

No automatic cataloging, governance, or quality checks.

2. No ACID Transactions

Problem:

Engineer A: Writes member_enrollment_2024.parquet Engineer B: Reads member_enrollment_2024.parquet (same time) Result: Corrupted file, half-written data, errors

S3 doesn't guarantee atomicity, consistency, isolation, durability (ACID).

3. Performance Issues

Reading raw Parquet from S3:

# Query 100M claims
df = spark.read.parquet("s3://lake/raw/claims/")
df.filter(col("service_date") >= "2024-01-01")

# Problem: Scans ALL files (slow, expensive)
# Time: 10-15 minutes
# Cost: $50 per query

No indexes, no optimization, full table scans every time.

When to Use a Data Lake

✅ Store massive volumes of raw data cheaply
✅ Archive historical data (10+ years)
✅ Unstructured data (images, PDFs, audio)
✅ "Land and expand" strategy (figure out schema later)
✅ Data science exploration (try things without schema constraints)

❌ Don't use for: Production analytics, BI dashboards, real-time queries

Cost & Implementation

Storage cost: $23/TB/month (S3 standard)

Implementation:

  • Time: 1 week (just create S3 buckets and upload)
  • Team: 1 data engineer
  • Tools: AWS S3, Azure Data Lake Storage, Google Cloud Storage

Healthcare example:

  • 100TB raw data (10 years of claims, images, PDFs)
  • Storage: $2,300/month
  • Setup: 1 week
  • Team: 1 engineer

Delta Lake: Data Lake + Structure

What It Is

Delta Lake adds a transaction layer on top of data lakes (S3/ADLS/GCS) to enable ACID transactions, versioning, and schema enforcement.

Think: Data lake that doesn't corrupt when multiple people write simultaneously

Architecture:

Raw Data Sources ↓ Delta Lake (S3 + Delta Transaction Log) ├── /delta/claims/ │ ├── part-0001.parquet (data files) │ ├── part-0002.parquet │ └── _delta_log/ (transaction log - the magic) │ ├── 00000000000000000000.json (version 0) │ ├── 00000000000000000001.json (version 1) │ └── 00000000000000000002.json (version 2) ↓ [Guaranteed consistency, time-travel, no corruption]

Strengths

1. ACID Transactions (No More Data Corruption)

Problem with plain data lake:

10 engineers writing to same table simultaneously → File corruption → Partial reads → Lost data

Delta Lake solution:

# Multiple concurrent writes - no corruption
df1.write.format("delta").mode("append").save("s3://delta/claims")
df2.write.format("delta").mode("append").save("s3://delta/claims")
df3.write.format("delta").mode("append").save("s3://delta/claims")

# Delta Lake transaction log ensures:
# - Atomic writes (all or nothing)
# - Consistent reads (always see complete data)
# - Isolation (writes don't interfere)
# - Durability (committed data never lost)

2. Time Travel (Audit Trail for Free)

Query any historical version:

# Read claims as of yesterday
df = spark.read.format("delta").option("versionAsOf", 0).load("s3://delta/claims")

# Read claims as of specific timestamp
df = spark.read.format("delta").option("timestampAsOf", "2024-03-01").load("s3://delta/claims")

# See full history
deltaTable.history().show()

Healthcare compliance: Built-in audit trail (required for [HIPAA](https://www.mdatool.com/glossary#HIPAA), FDA)

3. Schema Enforcement & Evolution

Problem with plain data lake:

# Day 1: member_id is STRING
df1 = [("M123", "John", "Doe")]

# Day 2: Someone writes member_id as INT
df2 = [(456, "Jane", "Smith")]

# Result: Data corruption, mixed types

Delta Lake solution:

# Schema enforced
df.write.format("delta").save("s3://delta/members")

# Try to write incompatible schema
df_bad.write.format("delta").mode("append").save("s3://delta/members")
# → ERROR: Schema mismatch

# Or evolve schema safely
df.write.format("delta").option("mergeSchema", "true").mode("append").save("s3://delta/members")

4. Optimized Performance

Data compaction:

# Small files problem in data lake
# 10,000 small files = slow queries

# Delta Lake compaction
deltaTable.optimize().executeCompaction()

# Result: 10,000 files → 100 optimized files
# Query time: 10 min → 30 seconds

Z-ordering (clustering):

# Optimize for common query patterns
deltaTable.optimize().executeZOrderBy("service_date", "member_id")

# Queries filtering by service_date or member_id = 10x faster

Weaknesses

1. Still Not a Data Warehouse

Delta Lake:

  • No automatic query optimization
  • No automatic clustering (you must run OPTIMIZE)
  • No separation of storage/compute (pay for compute even when idle)
  • No BI tool integrations (Tableau can't query Delta directly)

Snowflake:

  • Automatic optimization
  • Auto-clustering
  • Separate compute (pay only when querying)
  • Native BI integrations

2. Requires Spark Expertise

To use Delta Lake effectively:

from delta.tables import *

# Write
df.write.format("delta").mode("append").save("s3://delta/claims")

# Optimize
deltaTable = DeltaTable.forPath(spark, "s3://delta/claims")
deltaTable.optimize().executeCompaction()
deltaTable.optimize().executeZOrderBy("service_date")

# Vacuum old files
deltaTable.vacuum(168)  # Keep 7 days of history

Your team needs PySpark skills, not just SQL.

3. Cost = Storage + Compute

Cost breakdown:

  • Storage: $23/TB/month (S3)
  • Compute: Databricks clusters ($0.40/DBU + EC2 costs)

Running queries:

Medium cluster: $8/hour (compute) Query time: 10 minutes Cost per query: $1.33 vs Snowflake: Medium warehouse: $8/hour Query time: 10 seconds Cost per query: $0.02

Delta Lake is cheaper for storage, more expensive for ad-hoc queries.

When to Use Delta Lake

✅ Building a lakehouse (data lake + warehouse features)
✅ Need ACID transactions on data lake
✅ Large-scale ETL pipelines (Databricks/Spark)
✅ ML/AI workloads on raw data
✅ Time-travel/audit requirements
✅ Team has Spark expertise

❌ Don't use for: BI dashboards (use warehouse), simple analytics (overkill)

Cost & Implementation

Storage cost: $23/TB/month (S3)
Compute cost: $0.40/DBU + cloud compute

Implementation:

  • Time: 2-4 weeks (setup Databricks, migrate from data lake)
  • Team: 2-3 data engineers (Spark expertise required)
  • Tools: Databricks, Delta Lake, Spark

Healthcare example:

  • 50TB curated data in Delta Lake
  • Storage: $1,150/month
  • Compute (ETL pipelines): $3,000-5,000/month
  • Total: ~$4,500/month
  • Setup: 3-4 weeks
  • Team: 2 Spark engineers

Data Warehouse: Structured Analytics Engine

What It Is

A data warehouse is a centralized repository of structured, cleaned, transformed data optimized for analytics and BI.

Think: SQL database on steroids, designed for fast queries on billions of rows

Architecture:

Raw Data (Data Lake) ↓ ETL Pipeline (Transform, Clean, Model) ↓ Data Warehouse (Snowflake/Redshift/BigQuery) ├── Star Schema │ ├── dim_member (members table) │ ├── dim_provider (providers table) │ ├── dim_date (date dimension) │ └── fact_claim (100M claim records) ↓ BI Tools (Tableau, Looker, Power BI) ↓ Business Users (SQL queries, dashboards)

Strengths

1. Blazing Fast SQL Queries

Same query:

Data Lake (Parquet on S3):

-- Scan 100M claims
SELECT COUNT(*) FROM claims 
WHERE service_date >= '2024-01-01';

Time: 10 minutes
Cost: $5

Delta Lake (Optimized):

-- Scan 100M claims (optimized)
SELECT COUNT(*) FROM claims 
WHERE service_date >= '2024-01-01';

Time: 2 minutes
Cost: $1

Data Warehouse (Snowflake):

-- Scan 100M claims (auto-clustered)
SELECT COUNT(*) FROM claims 
WHERE service_date >= '2024-01-01';

Time: 2 seconds
Cost: $0.01

100x faster, 100x cheaper for analytical queries.

2. Automatic Optimization

Snowflake example:

-- No manual optimization needed
CREATE TABLE claim_header (
  claim_id VARCHAR(50),
  service_date DATE,
  paid_amount NUMBER(10,2)
);

-- Snowflake automatically:
-- ✓ Clusters data by insertion order
-- ✓ Creates micro-partitions
-- ✓ Compresses data
-- ✓ Optimizes query plans
-- ✓ Caches results

-- You just query
SELECT * FROM claim_header WHERE service_date = '2024-03-01';
-- → Instant (2 seconds)

Delta Lake:

# Manual optimization required
deltaTable.optimize().executeCompaction()
deltaTable.optimize().executeZOrderBy("service_date")
deltaTable.vacuum(168)

# Run these weekly or queries slow down

3. Separation of Storage and Compute

Snowflake architecture:

Storage Layer (always running, always available) ↓ Virtual Warehouse 1 (ETL) - runs 6am-8am Virtual Warehouse 2 (Reporting) - runs 8am-6pm Virtual Warehouse 3 (Analytics) - runs as needed Pay only when virtual warehouses are running

Cost savings:

Traditional database: $50K/month (always running) Snowflake: $15K/month (auto-suspend when idle) Savings: $35K/month (70%)

4. Native BI Tool Integration

Tableau → Snowflake:

  • One click connection
  • Live queries (not extracts)
  • Sub-second dashboard refreshes
  • No ETL needed

Tableau → Delta Lake:

  • Set up JDBC/ODBC connection (complex)
  • Start Spark cluster (slow, expensive)
  • Query performance inconsistent
  • Requires Databricks SQL endpoint ($$$)

5. Time-Travel and Zero-Copy Cloning

Snowflake time-travel:

-- Query data as of 1 hour ago
SELECT * FROM claims AT (OFFSET => -3600);

-- Restore accidentally deleted data
CREATE TABLE claims_restored CLONE claims AT (OFFSET => -3600);

-- Zero-copy clone (instant, no storage duplication)
CREATE TABLE claims_dev CLONE claims;
-- → Dev environment in 5 seconds, 0 additional storage cost

Weaknesses

1. Higher Storage Cost

Cost comparison (1TB):

  • S3 (Data Lake): $23/month
  • Snowflake: $40/month
  • Redshift: $50/month

For 100TB:

  • S3: $2,300/month
  • Snowflake: $4,000/month
  • Difference: $1,700/month

But: Query performance often justifies the extra cost

2. Not Great for Unstructured Data

Snowflake is designed for tables:

-- Great
CREATE TABLE claim_header (...);

-- Possible but awkward
CREATE TABLE medical_images (
  image_id VARCHAR(50),
  image_data BINARY -- 10MB X-ray image
);
-- → Expensive, slow, not what Snowflake is designed for

Unstructured data belongs in data lake, not warehouse.

3. Vendor Lock-In

Snowflake proprietary features:

  • Time-travel
  • Zero-copy cloning
  • Streams and tasks
  • Snowpipe

Migrating off Snowflake = rewrite all these features.

Delta Lake = open-source standard (portable)

When to Use a Data Warehouse

✅ Primary workload is SQL analytics
✅ BI dashboards for business users
✅ Hundreds of analysts running queries
✅ Need sub-second query performance
✅ Team is SQL-based (not Spark engineers)
✅ Ad-hoc analysis requirements

❌ Don't use for: Unstructured data storage, ML training on raw data

Cost & Implementation

Storage cost: $40-50/TB/month (Snowflake/Redshift)
Compute cost: $2-100/hour per warehouse (based on size)

Implementation:

  • Time: 8-12 weeks (data modeling, ETL, testing)
  • Team: 3-5 data engineers, 1 analytics engineer, 1 data modeler
  • Tools: Snowflake/Redshift/BigQuery, dbt, Fivetran/Airbyte

Healthcare example:

  • 20TB modeled data (claims, members, providers)
  • Storage: $800/month
  • Compute (3 warehouses): $8,000/month
  • Total: ~$9,000/month
  • Setup: 10-12 weeks
  • Team: 4 engineers + 1 architect

Data Mart: Focused Subset

What It Is

A data mart is a subset of a data warehouse focused on a specific business function, department, or use case.

Think: Finance's view of the warehouse (only claims, payments, revenue — no clinical data)

Architecture:

Data Warehouse (Enterprise) ├── All Claims (100M records) ├── All Members (10M records) ├── All Providers (500K records) ├── All Encounters (50M records) └── All Lab Results (200M records) ↓ Data Marts (Departmental Views) ├── Finance Mart │ ├── Claims (financial columns only) │ ├── Payments │ └── Revenue │ ├── Clinical Mart │ ├── Encounters │ ├── Diagnoses │ └── Lab Results │ └── Operations Mart ├── Provider Performance ├── Network Utilization └── Capacity Planning

Strengths

1. Simplified Access

Problem:

-- Finance user queries entire warehouse (overwhelming)
SELECT * FROM dim_member
JOIN fact_claim ON ...
JOIN dim_provider ON ...
JOIN fact_encounter ON ...
JOIN dim_diagnosis ON ...
-- → 50 tables, 847 columns, nobody knows what to query

Solution: Finance Data Mart

-- Pre-built views for finance
SELECT * FROM finance.monthly_revenue;
SELECT * FROM finance.claims_aging;
SELECT * FROM finance.provider_payments;
-- → 10 tables, 50 columns, all relevant to finance

2. Performance Optimization

Data mart strategies:

A) Materialized views (pre-aggregated):

-- Warehouse: Aggregate on every query (slow)
SELECT 
  DATE_TRUNC('month', service_date) as month,
  SUM(paid_amount) as total_paid
FROM fact_claim
GROUP BY DATE_TRUNC('month', service_date);
-- → 10 seconds each time

-- Data Mart: Pre-aggregated materialized view
CREATE MATERIALIZED VIEW finance.monthly_paid AS
SELECT 
  DATE_TRUNC('month', service_date) as month,
  SUM(paid_amount) as total_paid
FROM fact_claim
GROUP BY DATE_TRUNC('month', service_date);
-- → Query takes 0.5 seconds (20x faster)

B) Denormalized tables:

-- Warehouse: Normalized (many joins)
SELECT 
  m.member_name,
  p.provider_name,
  c.paid_amount
FROM fact_claim c
JOIN dim_member m ON c.member_id = m.member_id
JOIN dim_provider p ON c.provider_id = p.provider_id;
-- → 5 second query

-- Data Mart: Denormalized (no joins)
SELECT 
  member_name,
  provider_name,
  paid_amount
FROM finance.claims_denormalized;
-- → 0.2 second query (25x faster)

3. Security & Governance

Row-level security:

-- Finance sees only their data
CREATE VIEW finance.claims_view AS
SELECT * FROM fact_claim
WHERE line_of_business = 'Commercial';

-- Clinical sees different data
CREATE VIEW clinical.claims_view AS
SELECT * FROM fact_claim
WHERE has_clinical_data = TRUE;

-- Finance can't see clinical data, vice versa

4. Cost Efficiency

Data mart implementation:

  • No additional storage (views on warehouse)
  • No separate infrastructure
  • Reuses existing warehouse compute

Cost: $0 additional (just warehouse cost)

Weaknesses

1. Maintenance Overhead

10 data marts = 10× maintenance:

Update warehouse schema → Update finance mart → Update clinical mart → Update operations mart → Update quality mart → Update compliance mart → ... (repeat for each mart)

Each mart adds complexity.

2. Data Duplication (If Physically Separate)

Problem:

Warehouse: 20TB Finance Mart: 5TB (copied from warehouse) Clinical Mart: 8TB (copied from warehouse) Operations Mart: 3TB (copied from warehouse) Total storage: 36TB (80% duplication!)

Solution: Use views, not copies

3. Potential Stale Data

Materialized views:

-- Created Monday 9am
CREATE MATERIALIZED VIEW finance.monthly_paid AS ...

-- Tuesday 10am: Warehouse updated, mart not refreshed
-- Finance users see Monday's data (stale)

-- Must refresh
REFRESH MATERIALIZED VIEW finance.monthly_paid;

When to Use a Data Mart

✅ Department needs focused dataset
✅ Performance optimization for specific queries
✅ Simplify access for business users
✅ Security/governance requirements
✅ Different departments need different views

❌ Don't use if: Warehouse performance is good enough

Cost & Implementation

Cost: $0 additional (uses warehouse infrastructure)

Implementation:

  • Time: 1-2 weeks per mart
  • Team: 1 analytics engineer
  • Tools: dbt (for views/materialized views)

Healthcare example:

  • Finance data mart (5 views, 3 materialized views)
  • Setup: 1 week
  • Team: 1 analytics engineer
  • Cost: $0 (reuses warehouse)

The Complete Modern Architecture

How They Work Together

┌─────────────────────────────────────────────────────────┐ │ RAW DATA SOURCES │ │ EDI Claims • Member Files • Provider Data • Images │ └────────────────────┬────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────┐ │ DATA LAKE (S3) │ │ Store everything raw • $23/TB/month • Unstructured OK │ │ /raw/claims/*.json /raw/images/*.dcm /raw/pdf/*.pdf │ └────────────────────┬────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────┐ │ DELTA LAKE (Databricks + S3) │ │ ACID transactions • Time-travel • ML/AI workloads │ │ /delta/claims /delta/members /delta/providers │ │ Used for: ETL processing, ML model training │ └────────────────────┬────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────┐ │ DATA WAREHOUSE (Snowflake) │ │ Fast SQL queries • BI dashboards • Business analytics │ │ dim_member dim_provider fact_claim fact_encounter │ └────────────────────┬────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────┐ │ DATA MARTS │ │ ├── Finance: Claims, Payments, Revenue │ │ ├── Clinical: Encounters, Diagnoses, Quality │ │ └── Operations: Network, Utilization, Capacity │ └─────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────┐ │ END USERS │ │ Analysts • Data Scientists • Business Users • Execs │ └─────────────────────────────────────────────────────────┘

Data Flow Example (Healthcare Claim)

Step 1: Raw ingestion → Data Lake

[[EDI 837](/terms/edi-837)](/terms/EDI%20837) claim file arrives ↓ Store in S3 (Data Lake) → s3://data-lake/raw/edi-837/2024-03-25/claim_001.json → Cost: $0.023 per GB → Time: Instant (just upload)

Step 2: Processing → Delta Lake

# Databricks job reads from S3
raw_claims = spark.read.json("s3://data-lake/raw/edi-837/")

# Parse, clean, validate
claims_clean = raw_claims \
  .withColumn("service_date", to_date(col("service_dt"), "yyyyMMdd")) \
  .withColumn("paid_amount", col("charges").cast("decimal(10,2)")) \
  .filter(col("claim_status") == "ACCEPTED")

# Write to Delta Lake
claims_clean.write.format("delta").mode("append").save("s3://delta/claims")

# Cost: $5/hour cluster
# Time: 30 minutes for 1M claims

Step 3: Modeling → Data Warehouse

-- Load from Delta Lake into Snowflake
COPY INTO claim_header
FROM @delta_stage/claims/
FILE_FORMAT = (TYPE = PARQUET);

-- Transform into star schema
INSERT INTO fact_claim
SELECT 
  c.claim_id,
  m.member_key,  -- Lookup dimension key
  p.provider_key,  -- Lookup dimension key
  c.service_date,
  c.paid_amount
FROM claim_header c
JOIN dim_member m ON c.member_id = m.member_id
JOIN dim_provider p ON c.provider_id = p.provider_id;

-- Cost: $0.50 for 1M claims
-- Time: 5 minutes

Step 4: Data Marts

-- Finance Mart (materialized view)
CREATE MATERIALIZED VIEW finance.daily_revenue AS
SELECT 
  service_date,
  COUNT(*) as claim_count,
  SUM(paid_amount) as total_paid
FROM fact_claim
GROUP BY service_date;

-- Cost: $0 (no additional storage)
-- Time: 30 seconds to create

Step 5: Business User Query

-- Finance analyst queries data mart
SELECT * FROM finance.daily_revenue
WHERE service_date = '2024-03-25';

-- Query time: 0.5 seconds
-- Cost: $0.01

Cost Comparison (Large Health Plan)

Scenario: 5M Members, 50M Claims/Year

Data volumes:

  • Raw data: 100TB (10 years)
  • Delta Lake: 30TB (curated)
  • Warehouse: 10TB (modeled)

Monthly costs:

Architecture 1: Data Lake Only (Cheapest, Worst Performance)

S3 storage: 100TB × $23 = $2,300/month Compute (Spark): $5,000/month (ETL + queries) Total: $7,300/month Performance: - Queries: 5-10 minutes - BI dashboards: Unusable - User satisfaction: 2/10

Architecture 2: Data Lake + Warehouse (Balanced)

S3 storage: 100TB × $23 = $2,300/month Snowflake storage: 10TB × $40 = $400/month Snowflake compute: $10,000/month Total: $12,700/month Performance: - Queries: 1-5 seconds - BI dashboards: Fast - User satisfaction: 9/10

Architecture 3: Full Modern Stack (Best, Most Expensive)

S3 storage: 100TB × $23 = $2,300/month Delta Lake compute: $8,000/month (Databricks) Snowflake storage: 10TB × $40 = $400/month Snowflake compute: $10,000/month Total: $20,700/month Performance: - Data Lake queries: 2-5 minutes (Spark) - Warehouse queries: 1-5 seconds - ML workloads: Optimized (Delta) - BI dashboards: Fast - User satisfaction: 10/10 Value: Best architecture for mixed workloads

Decision Framework

Choose Data Lake When:

✅ Storing massive raw data (TBs-PBs)
✅ Unstructured data (images, PDFs, audio)
✅ Archive/backup requirements
✅ Future use cases undefined
✅ Cost is primary concern
✅ Performance less critical

Healthcare fit: Medical image storage, document archives, raw EDI files

Cost: $23/TB/month
Team: 1 engineer
Time: 1 week


Choose Delta Lake When:

✅ Need ACID transactions on data lake
✅ Building lakehouse architecture
✅ Large-scale Spark/ETL pipelines
✅ ML/AI workloads on raw data
✅ Time-travel/audit requirements
✅ Team has Spark expertise

Healthcare fit: Claims processing pipelines, ML model training, clinical data ETL

Cost: $23/TB/month (storage) + $5-10K/month (compute)
Team: 2-3 Spark engineers
Time: 3-4 weeks


Choose Data Warehouse When:

✅ Primary workload is SQL analytics
✅ BI dashboards for business users
✅ Sub-second query performance needed
✅ Team is SQL-based (not Spark)
✅ Hundreds of concurrent users
✅ Ad-hoc analysis requirements

Healthcare fit: Financial reporting, [HEDIS](/terms/HEDIS) analytics, executive dashboards, operational reports

Cost: $40/TB/month (storage) + $5-15K/month (compute)
Team: 3-5 engineers + analyst
Time: 10-12 weeks


Choose Data Marts When:

✅ Department needs focused dataset
✅ Performance optimization needed
✅ Simplify access for specific users
✅ Security/governance requirements
✅ Different views for different teams

Healthcare fit: Finance views, clinical quality views, operations views

Cost: $0 additional (uses warehouse)
Team: 1 analytics engineer
Time: 1-2 weeks per mart


Real Healthcare Implementation

Scenario: Regional Health Plan (2M Members)

Phase 1: Data Lake (Week 1-2)

Setup S3 buckets Ingest 5 years of historical data (50TB) Cost: $1,150/month storage Team: 1 engineer

Phase 2: Delta Lake (Week 3-6)

Setup Databricks workspace Migrate critical datasets to Delta Build ETL pipelines (claims, members) Cost: $4,500/month (storage + compute) Team: 2 Spark engineers

Phase 3: Data Warehouse (Week 7-18)

Design star schema (claims, members, providers) Build dbt models Set up Snowflake Load 2 years of data (5TB) Cost: $6,000/month (storage + compute) Team: 3 engineers + 1 architect

Phase 4: Data Marts (Week 19-22)

Finance mart (claims, payments) Clinical mart (encounters, quality) Operations mart (network, utilization) Cost: $0 additional Team: 1 analytics engineer

Total implementation:

  • Timeline: 22 weeks (~5 months)
  • Team: 5-7 people
  • Monthly ongoing cost: ~$11,650
  • One-time cost: ~$500K (labor)

Conclusion

Data Lake, Delta Lake, Data Warehouse, and Data Mart aren't competing options — they're layers of a modern data architecture.

Data Lake: Cheap raw storage
Delta Lake: Structured data lake with ACID
Data Warehouse: Fast SQL analytics
Data Mart: Focused departmental views

Most healthcare organizations need all four:

  • Data Lake for raw storage ($2K/month)
  • Delta Lake for ETL/ML ($5K/month)
  • Data Warehouse for analytics ($6K/month)
  • Data Marts for departments ($0 additional)

Total: ~$13K/month for enterprise-grade data platform

Choose based on workload, not vendor marketing.

The right architecture uses each layer for what it does best.

Related Guides

Healthcare Analytics

Population health analytics, data warehousing, and clinical intelligence.

Read Guide

More in Data Architecture

Azure Synapse vs Snowflake for Healthcare Data Architecture: Which Platform Fits Your Team?

Azure Synapse Analytics and Snowflake both promise a unified cloud data platform — but they make different architectural bets that matter enormously in healthcare. This guide compares them across HIPAA compliance, FHIR integration, PHI governance, cost model, and team fit, with concrete SQL examples and a decision framework built for healthcare data engineers.

Read more

Oracle vs Databricks for Healthcare Data Architecture: Which Platform Should You Choose?

Oracle brings four decades of enterprise database maturity, deep EHR integration, and a proven HIPAA compliance story. Databricks brings a unified lakehouse, native AI/ML pipelines, and the ability to handle FHIR, HL7, and unstructured clinical data at scale. This guide breaks down which platform wins in each healthcare scenario — and when you need both.

Read more

Telehealth Data Architecture: Complete Guide for Data Engineers (2026)

A complete guide to building a telehealth data architecture — core schema design, HL7 and FHIR integration, HIPAA compliance, HCC risk adjustment, and the common mistakes that cause claim denials.

Read more

Free Tools

Free SQL Linter

Catch SQL bugs, performance issues, and naming violations before production.

Try it free

Free DDL Converter

Translate SQL schemas between Snowflake, BigQuery, Oracle, and SQL Server.

Try it free

Ready to improve your data architecture?

Free tools for DDL conversion, SQL analysis, naming standards, and more.

Get Started Free

Get weekly healthcare data engineering tips

Practical guides on data modeling, SQL standards, and healthcare domain conventions — straight to your inbox.

No spam. Unsubscribe any time.

On this page

  • The Quick Answer (Before We Dive Deep)
  • Data Lake: The Raw Storage Layer
  • **What It Is**
  • **Strengths**
  • **Weaknesses**
  • **When to Use a Data Lake**
  • **Cost & Implementation**
  • Delta Lake: Data Lake + Structure
  • **What It Is**
  • **Strengths**
  • **Weaknesses**
  • **When to Use Delta Lake**
  • **Cost & Implementation**
  • Data Warehouse: Structured Analytics Engine
  • **What It Is**
  • **Strengths**
  • **Weaknesses**
  • **When to Use a Data Warehouse**
  • **Cost & Implementation**
  • Data Mart: Focused Subset
  • **What It Is**
  • **Strengths**
  • **Weaknesses**
  • **When to Use a Data Mart**
  • **Cost & Implementation**
  • The Complete Modern Architecture
  • **How They Work Together**
  • **Data Flow Example (Healthcare Claim)**
  • Cost Comparison (Large Health Plan)
  • **Scenario: 5M Members, 50M Claims/Year**
  • Decision Framework
  • **Choose Data Lake When:**
  • **Choose Delta Lake When:**
  • **Choose Data Warehouse When:**
  • **Choose Data Marts When:**
  • Real Healthcare Implementation
  • **Scenario: Regional Health Plan (2M Members)**
  • Conclusion

Share

Share on XShare on LinkedIn

Engineering Tools

Convert DDL, lint SQL, and audit naming conventions — free.

Explore Tools