Data EngineeringHypothetical case study

Case Study: Designing a GDPR-Compliant Data Platform for a Healthcare Analytics Company

2 April 2026·11 min read·Neo Analytica

The Problem

A UK-based healthcare analytics company had built a promising product: an AI-driven platform that helped NHS trusts and private healthcare providers identify patients at risk of hospital readmission within 30 days of discharge.

The clinical model worked. In pilot studies, it reduced readmission rates by 18%. The problem was everything underneath it.

Their data infrastructure had been built by a small team of data scientists who, understandably, had optimised for model accuracy rather than production readiness. The result:

Patient data was stored in plain text across multiple systems — S3 buckets, local PostgreSQL databases, and Jupyter notebook outputs. There was no encryption at rest, no access logging, and no data retention policy.
No data lineage. When a clinical lead asked "where does this risk score come from?", the answer required manually tracing through Python scripts, CSV exports, and undocumented database views. For an organisation processing NHS patient data, this was untenable.
Compliance was manual. Subject Access Requests (SARs) took 2–3 weeks to fulfil because no one could quickly identify everywhere a specific patient's data existed. The UK GDPR requires SARs to be completed within one month, and the clock was ticking.
No environments. Development, testing, and production all shared the same database. A data scientist running an experimental query could — and once did — accidentally modify production patient records.

The company had just secured Series A funding and was preparing to onboard three new NHS trusts. Their investors and prospective clients both required evidence of robust data governance before contracts could be signed.

The CEO told us: "Our model saves lives. But if we can't prove our data practices are sound, no trust will let us near their patient records."

Why They Were Hesitant

Healthcare data is uniquely sensitive, and this client had concerns that went beyond the typical data engineering engagement:

"We can't slow down the data science team"

The data scientists were the company's core asset. Any new architecture that added friction to their workflow — requiring them to learn new tools, wait for data access approvals, or change how they accessed training data — risked slowing down the very thing that made the company valuable.

"We need to pass an NHS Data Security and Protection Toolkit (DSPT) assessment"

The DSPT is the NHS's mandatory information governance framework. Failing it would mean losing access to NHS data entirely. The assessment covers 42 assertions across 10 security standards, and the company needed to demonstrate compliance within 4 months.

"Patient data cannot leave the UK"

UK GDPR and the NHS's data processing agreements require that patient data is stored and processed within the UK. This ruled out several cloud regions and some SaaS tools that could not guarantee UK-only data residency.

"How do we know you won't have access to patient data?"

The client was rightly cautious about giving an external consultancy access to sensitive health data. They needed an architecture that allowed us to build and test the platform without ever seeing real patient data.

Our Approach

We designed the engagement around a principle we call "compliance by architecture" — building governance into the data platform's structure rather than bolting it on afterwards.

Week 1–2: Architecture Design & Synthetic Data

Before touching any real data, we did two things:

1. Designed the full architecture on paper. Every component was assessed against three criteria: UK GDPR compliance, NHS DSPT requirements, and operational practicality.

2. Built a synthetic data generator. We created a Python-based tool that produced realistic but entirely fake patient data — matching the schema, distributions, and edge cases of real NHS data. This allowed us to build, test, and validate the entire platform without accessing a single real patient record.

The synthetic data approach solved the access concern completely. We worked exclusively with fake data. The client's internal team handled the one-time migration of real data into the production system.

Week 3–5: Platform Build

The architecture we implemented:

Data Sources (NHS Trusts, GP Systems, Hospital PAS)
    │
    ▼
Secure Ingestion Layer
    ├── Fivetran (UK-hosted, SOC 2 Type II, ISO 27001)
    ├── Encrypted in transit (TLS 1.3)
    └── Landing in encrypted S3 bucket (AES-256, UK region only)
    │
    ▼
AWS (eu-west-2, London region only)
    ├── S3 (encrypted at rest, versioned, lifecycle policies)
    ├── Snowflake (UK-hosted, Business Critical edition)
    │   ├── Bronze: raw patient data (encrypted, access-logged)
    │   ├── Silver: pseudonymised data (NHS number replaced with hash)
    │   └── Gold: aggregated, de-identified analytics models
    └── Audit logging (CloudTrail + Snowflake Access History)
    │
    ▼
dbt (transformations with built-in governance)
    ├── Pseudonymisation applied at Silver layer
    ├── Data minimisation enforced (only necessary fields promoted)
    ├── Retention policies applied (automated deletion after period)
    └── Full column-level lineage tracked
    │
    ▼
Metabase (dashboards, role-based access control)
    └── Data scientists see pseudonymised Silver data
    └── Clinical users see aggregated Gold data only
    └── No direct database access — all queries go through views

The Three Pillars of Compliance

Pillar 1: Pseudonymisation at the Silver Layer

Real patient identifiers (NHS number, name, date of birth, postcode) were replaced with irreversible hashes at the Silver layer. The mapping between real identifiers and pseudonyms was stored in a separate, heavily restricted Snowflake database that only two named individuals could access.

This meant:

Data scientists could work with realistic, linked patient journeys without seeing who the patients were
Clinical dashboards showed population-level trends, never individual patient records
A data breach at the analytics layer would expose pseudonymised data only — significantly reducing the severity under UK GDPR's breach notification requirements

The pseudonymisation was implemented as a dbt macro, ensuring it was applied consistently and automatically every time data was processed. No human could forget to apply it.

Pillar 2: Role-Based Access Control (RBAC)

We implemented four access tiers in Snowflake:

Role	Can Access	Cannot Access	Use Case
`data_engineer`	Bronze + Silver + Gold (no PII columns)	Raw PII fields	Pipeline maintenance
`data_scientist`	Silver (pseudonymised) + Gold	Bronze, PII mapping table	Model training
`clinical_analyst`	Gold only (aggregated)	Bronze, Silver, PII	Clinical dashboards
`dpo_admin`	PII mapping table (read-only)	Bronze raw data	Subject Access Requests

Every query was logged. Snowflake's Access History feature recorded who queried what data, when, and from which IP address. These logs were retained for 2 years, satisfying the DSPT's audit trail requirements.

Pillar 3: Data Minimisation and Retention

UK GDPR's data minimisation principle requires that you only process the personal data you actually need. We enforced this architecturally:

At ingestion: Fivetran was configured to sync only the specific tables and columns needed. We did not ingest entire database snapshots.
At transformation: dbt models in the Silver layer explicitly selected only the columns required for downstream analytics. Unnecessary fields were dropped, not just hidden.
At retention: Automated Snowflake tasks deleted patient-level data older than the agreed retention period (24 months for clinical data, 6 months for operational data). Aggregated Gold-layer data was retained indefinitely as it contained no personal data.
At access: Column-level security in Snowflake masked PII fields for any role that didn't explicitly require them. Even if a query selected SELECT *, PII columns returned masked values.

Week 6–7: Testing and Validation

We ran three categories of tests:

Data Quality Tests (dbt):

Uniqueness on patient pseudonym IDs
Not-null constraints on clinical fields (diagnosis codes, admission dates)
Referential integrity between admissions and discharge records
Accepted value ranges for clinical scores
Freshness tests ensuring no source was more than 6 hours stale

Security Tests:

Attempted to access PII columns from each RBAC role — confirmed denial
Attempted to query Bronze data from the clinical_analyst role — confirmed denial
Verified audit logs captured every access attempt, including denied ones
Confirmed data at rest encryption across all S3 buckets and Snowflake tables
Verified no data was stored or processed outside eu-west-2

Compliance Tests (against DSPT assertions):

Data lineage traceable from dashboard metric to source record
Subject Access Request fulfilment simulated end-to-end in under 2 hours
Data retention policies verified — records older than retention period confirmed deleted
Breach response simulated — time from detection to notification measured at 45 minutes

Week 8–9: DSPT Preparation

We worked alongside the client's Data Protection Officer to compile evidence for the 42 DSPT assertions. For each assertion, we provided:

The architectural control that addressed it
The automated test or monitoring that verified it
The documentation or runbook that described the process

The DSPT assessment was submitted at the end of week 9. It was approved without additional queries — a rare outcome that the client attributed directly to the strength of the architectural evidence.

Week 10: Knowledge Transfer

As with every engagement, we spent the final phase ensuring the client's team could operate independently:

Security runbook covering incident response, access provisioning, and audit log review
dbt model documentation with descriptions for every model, column, and test
Architecture Decision Records explaining every compliance-related design choice
SAR fulfilment guide with step-by-step instructions for responding to Subject Access Requests
Pair programming sessions with their data engineer on adding new data sources within the governance framework

Results

Metric	Before	After
SAR fulfilment time	2–3 weeks	< 2 hours
Data lineage coverage	0%	100% (column-level)
PII exposure in analytics layer	Full access	Zero (pseudonymised)
Environments	1 (shared)	3 (dev, staging, prod)
Automated compliance tests	0	34
DSPT assessment	Not attempted	Passed first submission
Time to onboard new NHS trust	~3 months	~2 weeks

The Business Impact

With the DSPT approved and the data platform in place, the client signed contracts with three new NHS trusts within 60 days. The total contract value exceeded £1.2 million — a direct return on the platform investment.

More importantly, the architecture scaled. Each new trust required only configuring a new Fivetran connector and running the existing dbt models. The onboarding time dropped from approximately 3 months of custom integration work to 2 weeks of configuration.

The clinical model's performance also improved. With cleaner, more complete data flowing through the platform, the readmission prediction accuracy increased from 74% to 81% — a clinically significant improvement that the data science team attributed to better data quality, not model changes.

Key Takeaways

For organisations processing sensitive data — healthcare, financial, legal, or otherwise:

Build compliance into the architecture, not the process. If compliance depends on people remembering to do things correctly, it will fail. Pseudonymisation, access control, retention, and minimisation should all be enforced by the platform itself.
Synthetic data unlocks safe development. By building a realistic synthetic data generator, we eliminated the need for external parties to access real patient data — and gave the internal team a safe environment for testing and experimentation.
Pseudonymisation is not anonymisation. Under UK GDPR, pseudonymised data is still personal data. But it significantly reduces risk, simplifies compliance, and enables powerful analytics without exposing identities. The key is ensuring the re-identification mapping is stored separately with strict access controls.
The DSPT is achievable with the right architecture. Many organisations view the DSPT as a bureaucratic hurdle. In practice, if your data platform is well-designed, most assertions are satisfied by technical controls that you'd want in place anyway.
Compliance is a competitive advantage. The client's competitors were still struggling to pass DSPT assessments. By investing in a compliant platform early, they were able to sign contracts that competitors could not — turning governance from a cost centre into a revenue driver.

The client's CEO told us: "We used to see compliance as something that slowed us down. Now it's the reason we're winning contracts. Our data governance is a selling point, not a constraint."