← All articles

Case Study: Migrating a Legacy ETL Pipeline to a Modern Data Stack in 6 Weeks

The Problem

A fast-growing UK fintech came to us with a problem that will sound familiar to many data leaders: their data pipeline was held together with duct tape.

Over three years, the company had grown from 5 to 80 employees. Their data infrastructure had not kept pace. What started as a handful of Python scripts pulling data from a PostgreSQL database into Google Sheets had evolved into a tangled web of 40+ scheduled scripts, cron jobs, and manual CSV exports — none of which were version-controlled, tested, or monitored.

The consequences were serious:

  • Data incidents every week. Reports would break silently. The finance team discovered a £230,000 revenue discrepancy caused by a script that had been failing for 11 days without anyone noticing.
  • 20+ hours per week spent by engineers manually fixing pipelines, re-running jobs, and answering "why doesn't this number match?" questions from stakeholders.
  • No single source of truth. Different teams maintained their own spreadsheets with conflicting numbers. Board reports were compiled manually and took 3 days to produce.
  • Regulatory risk. As an FCA-regulated firm, they needed auditable data lineage. Their current setup made it impossible to trace how any number was calculated.

The CTO told us: "We know our data is wrong. We just don't know which parts."

Why They Were Hesitant

Before engaging us, the client had three major concerns — ones we hear regularly from companies in similar positions:

"We can't afford downtime during migration"

Their existing pipelines, fragile as they were, powered daily regulatory reports and real-time fraud detection dashboards. A migration that disrupted these systems was not an option. The FCA doesn't accept "we were upgrading our infrastructure" as an excuse for a late report.

"We've been burned by consultants before"

A previous vendor had proposed a 6-month, £180,000 migration to a custom-built Airflow cluster. Three months in, nothing was in production. The project was abandoned, and the team lost confidence in external partners.

"Our team doesn't have the skills to maintain a modern stack"

The data team consisted of two analysts with strong SQL skills but no experience with tools like dbt, Terraform, or CI/CD pipelines. Any solution needed to be maintainable by their existing team — not just by us.

Our Approach

We designed a 6-week migration programme with three non-negotiable principles:

  1. Zero downtime. The old and new pipelines would run in parallel until the new system was validated.
  2. Incremental delivery. We would migrate one data source per sprint, proving value every two weeks.
  3. Knowledge transfer built in. Every decision was documented. Every model was pair-programmed with their analysts.

Week 1–2: Foundation

We stood up the core infrastructure:

  • Fivetran for automated data ingestion from PostgreSQL, Stripe, Intercom, and HubSpot. This replaced 28 custom Python scripts.
  • Snowflake as the cloud data warehouse, replacing a self-managed PostgreSQL replica that was running out of disk space every month.
  • dbt for transformations, with a clear Medallion Architecture: staging models (Bronze), cleaned and joined models (Silver), and business-ready reporting models (Gold).
  • GitHub Actions for CI/CD — every dbt model change went through code review and automated testing before deployment.

We also implemented Terraform for infrastructure-as-code, ensuring the entire environment could be recreated from scratch if needed.

Week 3–4: Core Models

We migrated the three most critical data domains:

Revenue & Billing — Stripe payment data joined with internal subscription records to produce a single, reconciled revenue figure. We added dbt tests for:

  • Uniqueness on transaction IDs (catching the duplicate that caused the £230,000 discrepancy)
  • Not-null constraints on amount fields
  • Referential integrity between subscriptions and payments
  • Accepted value ranges for currency codes

Customer Data — HubSpot CRM data joined with product usage data to create a unified customer profile. This replaced a manually-maintained Google Sheet that three teams were editing independently.

Regulatory Reporting — FCA-mandated metrics calculated in dbt with full lineage. Every number could now be traced back to the raw source data, satisfying their compliance team's requirements.

Week 5–6: Validation & Cutover

We ran both systems in parallel for two weeks. A reconciliation dashboard compared outputs from the old and new pipelines side by side:

  • Revenue figures matched to the penny after we corrected the duplicate transaction bug in the legacy system.
  • Customer counts matched within 0.3% — the difference was traced to the legacy system double-counting customers with multiple email addresses.
  • Regulatory metrics matched exactly, giving the compliance team confidence to approve the cutover.

On day 28, we switched off the legacy scripts. No one noticed — which was exactly the point.

The Architecture

Sources (PostgreSQL, Stripe, Intercom, HubSpot)
    │
    ▼
Fivetran (automated ingestion, schema change detection)
    │
    ▼
Snowflake (cloud data warehouse)
    │
    ▼
dbt (transformations)
    ├── staging/     → stg_stripe__payments, stg_hubspot__contacts
    ├── intermediate/→ int_revenue__reconciled
    └── marts/       → fct_daily_revenue, dim_customers, rpt_fca_monthly
    │
    ▼
Metabase (dashboards & self-serve analytics)
    │
    ▼
GitHub Actions (CI/CD, automated testing on every commit)

What We Tested — And Why It Matters

Data quality testing was the single most impactful part of the migration. We implemented 47 dbt tests across the project:

  • Schema tests — uniqueness, not-null, accepted values, relationships
  • Custom data tests — revenue reconciliation checks, row count anomaly detection, freshness thresholds
  • Source freshness tests — alerts if any source data is more than 4 hours stale

These tests run automatically on every pipeline execution. In the first week after go-live, they caught 3 issues that would have previously gone undetected for days or weeks:

  1. A Stripe webhook that started sending duplicate events after a Stripe API upgrade
  2. A HubSpot contact merge that created orphaned deal records
  3. A time zone mismatch in Intercom event timestamps

Each issue was caught within minutes, not days. The team fixed them before any downstream report was affected.

Results

Six weeks after kickoff, the client had:

MetricBeforeAfterChange
Data incidents per month161-94%
Pipeline maintenance hours/week201.5-92%
Time to produce board report3 days15 minutes-99%
Data sources integrated44 (same)Automated
Automated tests047From zero
Time to detect data issues3–11 days< 5 minutesReal-time

Beyond the Numbers

The qualitative impact was equally significant:

  • The finance team trusts the numbers. They stopped maintaining their shadow spreadsheet within two weeks.
  • The compliance team can answer regulator questions in minutes, not days. Full lineage from report → model → source is one click away.
  • Engineers are building features, not fixing pipelines. The 20 hours per week freed up is now spent on product analytics and ML experimentation.
  • Board reports are automated. The CEO receives a refreshed dashboard link every Monday morning — no manual assembly required.

Key Takeaways

For data leaders considering a similar migration:

  1. You don't need to migrate everything at once. Start with the most painful data domain and prove the approach works before expanding.
  2. Run systems in parallel. It costs a few extra weeks but gives stakeholders the confidence to approve the cutover.
  3. Testing is not optional. Automated data quality tests are the difference between a modern data stack and a new set of problems.
  4. Plan for knowledge transfer from day one. If your team can't maintain the system after the engagement, you haven't solved the problem — you've outsourced it.
  5. Legacy doesn't mean wrong. The old system had valuable business logic embedded in those messy scripts. We extracted and documented it, not discarded it.

The client's CTO summed it up: "For the first time in two years, I'm not worried about our data. It just works."