webbyfox/ services/ django product & legacy rescue

02 · Django practice

Django product engineering & legacy codebase rescue.

For CTOs, VPs of Engineering and founders running scaling SaaS or Fintech platforms — the team that brings nine-year-old Django monoliths back to life without a maintenance window, a death-march rewrite, or burning out the engineers who actually understand the code.

Sound familiar?

  • Average response time has crept from 200ms to 1.4s over two years and nobody knows exactly when
  • The Django upgrade is blocked by three unmaintained third-party packages that nobody dares replace
  • Multi-tenant database is one bad query away from a noisy-neighbour incident
  • "Fat models" file is 2,400 lines, the view file is 3,800, and the test suite skips half of them
  • OWASP scan flagged eleven mediums and your security team needs a path to "all green" by quarter-end
  • Senior engineer who built the original system is leaving in 90 days

The rescue & stabilisation framework.

Big-bang rewrites are how engineering teams end careers, not how they ship value. Every codebase we've ever rescued has been modernised through small, reversible, telemetry-gated steps — Strangler Fig at the routing layer, Branch by Abstraction at the module layer. The legacy code keeps running until it has zero traffic. Then it gets deleted.

R.01

Stop the bleeding

First two weeks. Find and triage the bugs and queries that are paging your team on weekends. Quick fixes only; no architectural changes yet.

→ on-call quiet
R.02

Map the seams

Static analysis, runtime tracing and dependency-graph extraction to identify the natural boundaries inside the monolith. The map is the contract — we modernise along seams, not against them.

→ seam_map.svg
R.03

Capture ground truth

Production traffic for the affected routes is captured (request, headers, response, timing) for replay. This is the regression harness — not the existing test suite, which is often the source of the original bug.

→ replay-harness
R.04

Strangler Fig routing

A proxy layer (nginx/Envoy or a Django middleware shim) routes a percentage of traffic to the modernised module. Replay-validated. Feature-flagged. Rolled back if metrics drift.

→ canary @ 1%
R.05

Branch by Abstraction

Inside the codebase, the legacy module is fronted by an abstraction. The new implementation lives alongside it. We flip implementations behind a flag — never two diverging branches in git.

→ feature flag
R.06

Delete the legacy

Telemetry shows zero traffic on the legacy path for two weeks. We delete it, the abstraction layer, and the feature flag in one PR. The codebase actually gets smaller.

→ net negative LoC

Decoupled architecture & service layers.

"Fat models" and view files that hold business rules feel productive at month three and unmaintainable by year three. We extract domain logic into typed, isolated service layers — Django ORM for persistence, services for business rules, views for HTTP only. The same patterns Django Software Foundation members recommend, applied with discipline.

// concern
Fat models / fat views
Service layer + thin views
testability
Every test needs a database, a request and a logged-in user.
Pure-Python tests for business rules. The HTTP layer barely needs unit tests.
reuse
CLI commands and Celery tasks duplicate the view logic.
One service function is called by views, management commands, tasks and tests alike.
transactions
Implicit. Easy to corrupt state when a signal fires mid-save.
Explicit atomic() blocks at the service boundary. Signals fire after commit.
type safety
Dictionaries everywhere; renames break at runtime.
Pydantic / dataclasses for inputs and outputs. mypy --strict on the services package.
refactor blast radius
A field rename touches 40 files across views, serialisers and templates.
Contained. Domain changes live behind the service interface; consumers don't notice.

Database optimisations we ship on every engagement

P.01

Kill the N+1

The single biggest Django performance win. We trace every serialiser and template loop, add select_related / prefetch_related with explicit Prefetch() querysets, and lock the wins in with an N+1 detector in CI.

avg −40% p95
P.02

Indexing strategy

EXPLAIN ANALYZE on the slowest queries from a week of production traffic. Add covering and composite indexes. Drop unused indexes (they slow writes). Use partial indexes where tenancy permits.

postgres
P.03

Multi-tier Redis caching

Per-view fragment cache, per-queryset cache, and a hot-key idempotency cache. Explicit cache invalidation on writes — never TTL-only. Cache stampede protection via cache.get_or_set() with locks.

redis · django-cachalot
P.04

Connection pooling & PgBouncer

PgBouncer in transaction mode in front of Postgres, CONN_MAX_AGE tuned for your worker model, and a hard cap on concurrent connections. Fewer 5xxs at peak.

pgbouncer
P.05

Async views where they earn it

Synchronous views for the 95% that just talk to Postgres. async def for the routes that fan out to multiple third-party APIs. Measured wins only.

django 5 async

Rigorous testing & security hardening.

Production traffic is the only honest test suite. We capture it, replay it, and use it as the ground truth that every refactored module must reproduce. On the security side, we work through the OWASP Top 10 with the same discipline — gated, automated, and verified in the CI pipeline.

T.01

Replay regression harness

One week of production traffic is captured (sanitised of PII) and turned into a deterministic replay suite. Every refactor PR must produce byte-identical responses or explicitly justify the diff.

→ replay.toml
T.02

Mutation testing on the seam

Once the service layer exists, mutation testing (mutmut) on the domain rules catches the "tests pass but the logic is wrong" failure mode that line-coverage misses entirely.

→ mutmut
T.03

OWASP-aligned hardening

Session cookie flags (HttpOnly, Secure, SameSite=Lax), CSP via django-csp, secret rotation, dependency scanning with pip-audit in CI. Concrete checklist, not a slide deck.

→ owasp top 10
T.04

Secrets & environment hygiene

Secrets out of settings.py, into a managed store (AWS Parameter Store, Vault, Doppler). django-environ enforces the shape. CI fails the build if a secret slips into a tracked file.

→ vault · gitleaks
T.05

Dependency alerting in CI

Daily pip-audit + safety check runs. New CVEs against your installed packages open a PR with the patched version pinned. Renovate or Dependabot for the cadence.

→ pip-audit
T.06

Observability that pays its bill

Structured logs, traces (Sentry, Honeycomb or OpenTelemetry), and one dashboard per critical user journey. Not 200 dashboards no one looks at — the 5 that on-call actually uses.

→ sentry · honeycomb

What changes after we engage.

Concrete, measurable outcomes. We agree the success metrics before we start and report against them weekly.

01
Modernised, version-current Django

Latest Django LTS, latest Python, all critical third-party packages replaced or pinned and patched. Upgrade path documented for the next two LTS bumps.

02
Measurable performance win

Median p95 improvement of 60–80% on the routes we touch, validated against production traffic — not synthetic benchmarks.

03
Service-layer domain code

Business rules live in typed, tested service modules. Views shrink. Tests get faster. Onboarding a new engineer takes days, not months.

04
CI/CD pipeline you can trust

Replay-based regression harness, mutation tests on the domain, pip-audit in CI, preview environments per PR. Merges land safely without manual QA gauntlets.

05
OWASP-aligned hardening report

Concrete remediation against the OWASP Top 10, with before/after evidence. Suitable for SOC 2, ISO 27001 and Cyber Essentials audit packs.

06
Knowledge handover

Recorded sessions, architecture decision records (ADRs), and a written runbook. Your team owns the code on day one of the engagement — not day one after launch.

Is this the right engagement for you?

Yes, if… we're a fit

  • Production Django (or a wrapped framework like DRF) of any age
  • You have a real, in-house engineering team you want to upskill — not replace
  • You can pay for replay capture (we never touch production without telemetry)
  • Business-critical: the cost of an outage materially affects revenue
  • 3+ months of runway for the engagement; we don't sprint-and-bounce

Probably not, if… consider others

  • You're pre-product-market-fit and the problem is "build the thing faster"
  • You want a fixed-price rewrite from scratch with no production traffic to migrate
  • You expect a 60-page tender response before a discovery conversation
  • Your stack is FastAPI or Flask only — we love them but we specialise in Django

Common questions.

How do you upgrade Django without taking the platform offline?

The Strangler Fig pattern. A thin proxy routes traffic between the legacy codebase and the modernised modules. New routes ship behind feature flags, get validated against captured production traffic, then take over the route entirely. The legacy module is deleted only when telemetry shows zero traffic for two weeks. No big-bang cutover, no scheduled maintenance window.

Our Django app is slow. What's usually the cause?

In 80% of audits, the top three culprits are: N+1 queries hidden inside template loops or DRF serialisers, missing composite indexes on hot tables, and synchronous third-party HTTP calls inside request handlers. We profile production traffic for one week, sequence fixes by impact-vs-effort, and ship them behind feature flags so wins (and any regressions) are visible immediately. Our Django performance article walks through the specific patterns.

Do you do rescue work on multi-tenant Django databases?

Yes. We've untangled both schema-per-tenant and row-level-tenancy systems. The most common failure modes are: cross-tenant data leaks via missing tenant filters in custom managers, lock contention from per-tenant migrations, and query-planner regressions as some tenants grow 100× larger than others. We diagnose, contain, and ship a path to safe multi-tenancy without a rewrite.

How do you avoid regressions during refactoring?

Before any change, we build an automated regression harness that captures production inputs and outputs for the affected code paths. Every refactored module must produce byte-identical responses for the captured traffic before it ships. We treat production traffic as the ground truth — not the existing test suite, which is often the source of the original bug. See our Django testing article for the full methodology.

Can you integrate AI features (RAG, agents) into our existing Django platform?

Yes — that's how our Applied AI practice works. We don't recommend a parallel Node service for AI: we embed AI directly into your Django ORM (pgvector, Celery, LangGraph) so embeddings, prompts and tool calls share the same domain model as the rest of your application.

What's the typical engagement length and team shape?

Most rescue engagements run 3–6 months on a rolling monthly cadence. The default squad is one lead engineer plus one specialist (database / security / async / frontend, depending on the brief). We work alongside your in-house team, not in a silo, so knowledge transfer is continuous rather than dumped in a final handover.

Inherited a Django monolith?
Let's rescue it — safely.

Codebase audit
3–5 days · from £2,500
Rescue squad
Rolling monthly · £15–25k
Discovery sprint
2 weeks · fixed-price · £8k