Introduction
Most healthcare data governance programs are built backwards. A breach happens, OCR sends a letter, and suddenly the organization is scrambling to produce a data inventory it never built, access logs it never retained, and policies it never enforced. The HIPAA data governance framework described here is built forward — starting with inventory and ending with a program that can survive an audit.
This is not a compliance checklist disguised as a guide. It is a practical architecture for governing healthcare data the way a senior engineer would design it: systematic, auditable, and maintainable without a dedicated governance team of ten.
Step 1: Build Your Data Inventory
You cannot govern what you cannot find. Before classifying PHI or assigning stewards, you need a complete inventory of every data store in your environment.
What to capture per asset
For each data store (table, file, API endpoint, data feed), capture:
- Asset name and location (database, schema, table; S3 bucket and prefix; API endpoint)
- Data type (structured, semi-structured, unstructured)
- Source system (Epic, Facets, 835 clearinghouse, lab vendor, etc.)
- Owner (system team, business unit)
- PHI flag (does this asset contain any of HIPAA's 18 PHI identifiers?)
- Access controls (who has access today, through what mechanism)
- Retention requirement (6 years for HIPAA; longer for state-specific rules)
How to build it
Start with your cloud infrastructure. Run automated discovery (BigID, Microsoft Purview Scan, or open-source tools like Apache Atlas + custom scanners) across your warehouses, lakes, and databases. Do not rely on manual inventory — it will be incomplete within 90 days.
Then layer in your HL7 and [FHIR](/terms/FHIR) feeds, your EDI transaction stores (837P, 837I, 835), and your operational databases. These are frequently missed in catalog scans because they sit behind integration middleware.
Step 2: Classify PHI and Sensitivity Levels
Once inventoried, every asset needs a classification. Define at minimum three tiers:
| Classification | Definition | Example |
|---|---|---|
| PHI — Restricted | Contains one or more HIPAA identifiers | Member SSN, DOB + diagnosis combination |
| PHI — Sensitive | De-identified but re-identification risk exists | Zip code + age + rare condition code |
| Internal | No PHI, but not public | Actuarial models, provider contract rates |
| Public | Safe for external access | Aggregated quality metrics |
HIPAA defines 18 specific identifiers that make data PHI. Build a classification policy that checks for these explicitly — do not rely on field names alone. A column named member_key can be PHI if it is linkable to a person.
Step 3: Define Access Controls
Classification drives access. For each PHI tier, define:
- Who can access it: Role-based access control (RBAC) groups, not individuals
- How they access it: Query interface, direct DB access, API, file export
- Under what conditions: Minimum necessary standard — no wildcard
SELECT *on PHI tables
In Snowflake, this looks like:
-- Grant access to de-identified view only, not base table GRANT SELECT ON VIEW analytics.member_deidentified TO ROLE analyst_role; REVOKE SELECT ON TABLE phi.member_demographics FROM ROLE analyst_role;
Implement column-level masking on PHI fields for roles that need partial access (e.g., a fraud analyst needs member state but not full address):
CREATE MASKING POLICY phi_ssn_mask AS (val STRING) RETURNS STRING -> CASE WHEN CURRENT_ROLE() IN ('phi_admin_role') THEN val ELSE '***-**-' || RIGHT(val, 4) END; ALTER TABLE phi.member_demographics MODIFY COLUMN ssn SET MASKING POLICY phi_ssn_mask;
Step 4: Implement Audit Logging
HIPAA's Security Rule requires audit controls — mechanisms that record and examine activity in systems containing ePHI. Audit logs must answer:
- Who accessed PHI?
- When did they access it?
- What did they access?
- What actions did they take (read, modify, export)?
What to log
- All queries against PHI tables (Snowflake Query History, BigQuery Data Access Logs)
- All data exports or file downloads
- All access control changes (role grants, policy modifications)
- All failed access attempts
Retain logs for a minimum of 6 years. Store them in a separate, tamper-evident location — not the same system they are logging.
A minimal audit log schema
CREATE TABLE governance.phi_access_log ( log_id BIGINT GENERATED ALWAYS AS IDENTITY, event_ts TIMESTAMP NOT NULL, user_id VARCHAR(100) NOT NULL, user_role VARCHAR(100), asset_name VARCHAR(500) NOT NULL, -- schema.table or file path action VARCHAR(50) NOT NULL, -- SELECT, UPDATE, EXPORT, LOGIN_FAIL rows_accessed INT, client_ip VARCHAR(45), query_hash VARCHAR(64), -- SHA-256 of the query text session_id VARCHAR(200), PRIMARY KEY (log_id) );
Step 5: Define Policies and Enforce Them
A policy that lives in a PDF is not a governance control. Every policy must have a technical enforcement mechanism.
| Policy | Technical Control |
|---|---|
| No PHI in non-production environments | Data masking in ETL pipelines; automated scan on dev/staging |
| Minimum necessary access | Column-level masking; view-based access over base tables |
| PHI retention schedule | Automated delete jobs; lifecycle policies on S3 |
| Schema change approval | DDL review in CI/CD; PR gates on PHI tables |
| Naming standards for PHI columns | Pre-deployment naming audit |
Step 6: Assign Data Stewardship Roles
Governance fails when nobody owns the data. Define three roles explicitly:
- Data Owner: Business executive accountable for a data domain (e.g., VP of Claims Operations owns claims data). Approves access requests. Does not need to be technical.
- Data Steward: Operational manager responsible for quality, definitions, and policy compliance within the domain. Maintains the business glossary. Reviews access anomalies.
- Data Custodian: The technical team (data engineering, platform engineering) responsible for implementing the controls the owner and steward define.
Governance Checklist
Before calling your framework production-ready, verify:
- Data inventory completed and reviewed for every data store
- PHI classification applied to all assets
- RBAC defined and enforced; no individual user has direct PHI access
- Column-level masking active on all 18 PHI identifier fields
- Audit logging active on all PHI tables; retained for 6 years
- Non-production environments masked or synthetic
- Data owners and stewards assigned to every PHI domain
- Schema change approval process defined (PR review, naming standards gate)
- Retention delete jobs scheduled and tested
- BAAs signed with every data vendor and tool vendor with PHI access
Key Takeaways
- Build the data inventory first — you cannot classify, control, or audit what you have not found.
- Every policy needs a technical enforcement mechanism. Documented policies without controls fail audits.
- Audit logging must be tamper-evident and retained for 6 years minimum.
- Stewardship roles must be assigned at the domain level, not the table level — the granularity becomes unmanageable.
- Schema governance is part of HIPAA compliance. Use the Naming Auditor to enforce PHI column naming standards before data reaches your governed environment.
mdatool Team
The mdatool team builds free engineering tools for healthcare data architects, analysts, and engineers working across payer, provider, and life sciences data.
Related Guides
Key Terms in This Article
More in Data Governance
SOC 2 Type II for Healthcare Data Platforms: What Engineers Need to Know
SOC 2 Type II is increasingly a vendor requirement and a customer expectation for healthcare data platforms. Here is what engineers need to implement — beyond what the auditors tell you.
Read more21st Century Cures Act: Data Architecture Requirements for Health IT Teams
The 21st Century Cures Act is not just a compliance checkbox — it mandates specific technical capabilities around open APIs, information blocking prohibition, and patient data access. Here is what your data architecture must deliver.
Read moreCMS Interoperability Rule Compliance: What Your Data Architecture Must Support
CMS-9115-F and its successors are not just policy — they are architectural requirements. Patient Access API, Provider Directory API, payer-to-payer exchange, and prior auth APIs each require specific technical capabilities your data team must build.
Read moreReady to improve your data architecture?
Free tools for DDL conversion, SQL analysis, naming standards, and more.