Published with the client's permission. Identifying details and the most sensitive specific numbers have been generalised; the methodology, fix list, and proportional results are accurate.

011. The client & the bottleneck

A UK-based SaaS in the recruitment-tech space. Series-A funded, ~200k monthly active users, Django 4.2 monolith with DRF, Postgres 15 on AWS RDS (db.m6g.2xlarge), 6× db.r6g.large read replicas, 14× EC2 web workers behind an ALB, 8× Celery workers, ElastiCache Redis for cache and Celery broker.

The brief that landed on our desk:

  • The monthly AWS bill had grown from £8,400 → £12,600 over fourteen months, while MAU had only grown 28%. Spend per user was creeping up.
  • p95 web latency had drifted from 320ms → 1.4s on the dashboard endpoints. Customer support tickets mentioned "slow" weekly.
  • Two Saturday-night incidents in the last quarter had paged the on-call engineer. Both were "RDS CPU at 100%, web workers timing out." Both were resolved by scaling up RDS — solving the symptom, not the cause.
  • The CTO had a finance-team-approved budget to spend on engineering performance work, but wanted an outcome they could put in board minutes — not "we improved a few things."

Engagement shape: a fixed-price 2-week Discovery Sprint to confirm the actual root cause and produce an evidence-backed remediation plan, followed by a 4-week build squad to execute it. Total billable: 6 weeks, £42,000.

The reframing the client needed

The bill wasn't growing because of growth. It was growing because of inefficiency that growth was exposing. Scaling up RDS was paying the operational tax instead of fixing the underlying problem — and the tax was compounding.

022. The profiling methodology

"Optimise Django" is not a brief. "Find what's actually costing you money and fix it in priority order" is. The first week was instrumentation and measurement; we did not change a single line of application code. The goal was to make decisions from data, not intuition.

Step 1 — Cost attribution

AWS Cost Explorer told us the bill breakdown by service: 58% RDS (primary + replicas + storage + IOPS), 19% EC2, 11% data transfer, 6% ElastiCache, 6% miscellaneous. RDS was the obvious leverage point.

Within RDS, pg_stat_statements told us where the actual database time was going. Two queries accounted for 63% of total database time: the dashboard "applications-for-user" query, and a webhook-receiver lookup that ran on every inbound event from the client's ATS integration.

Step 2 — Production traffic capture

We deployed a lightweight Django middleware that sampled 1% of production requests and wrote request + response payloads (PII-redacted) to S3 for one week. By the end of the week we had a representative replay corpus — exactly the traffic the system actually serves, not what the test suite assumes.

This corpus became the regression harness for everything that followed. No refactor shipped unless it produced byte-identical responses against the replay.

Step 3 — Request-level tracing

We instrumented the application with OpenTelemetry, exported traces to a temporary Honeycomb account (the client's existing Sentry covered errors but not detailed query timing). Two hours of representative traffic gave us a per-request flame graph for every endpoint.

The dashboard endpoint's flame graph was a horror story: 187 SQL queries per request, ~600ms of database time, ~250ms of Python ORM overhead, ~100ms rendering. Classic N+1 cascade, dressed up as a feature-rich dashboard.

before · simplified excerpt of the dashboard view python
def dashboard(request):
    apps = Application.objects.filter(user=request.user).order_by('-created')[:50]
    return render(request, 'dashboard.html', {'applications': apps})

# dashboard.html — the per-row template fragment that fired the storm
{% for app in applications %}
  <tr>
    <td>{{ app.candidate.name }}</td>            # 1 query each
    <td>{{ app.job.title }}</td>                 # 1 query each
    <td>{{ app.job.company.name }}</td>          # 1 query each
    <td>{{ app.status.label }}</td>              # 1 query each
    <td>{{ app.notes.count }}</td>               # 1 COUNT each
    <td>{{ app.last_interview.scheduled_at }}</td> # 1 query each
  </tr>
{% endfor %}
# 50 rows × 6 lookups = 300 queries, minus a few cache hits = the 187 we measured

Step 4 — A prioritised list

The final discovery output was a single spreadsheet — every flagged issue with impact on cost, effort to fix in person-days, risk of regression, and dependency. Sorted by impact/effort ratio. The top 7 items accounted for 90% of the projected savings; the bottom 23 we deliberately deferred.

033. ORM optimisation & query reduction

Six concrete fixes landed in the four-week build window, in this order:

Fix 1 — The dashboard query (single biggest win)

Replaced the 187-query cascade with one query using select_related for forward foreign keys and prefetch_related with a custom Prefetch() queryset for the reverse relations:

after · same endpoint, one query (well, two with the prefetch) python
from django.db.models import Count, Prefetch


def dashboard(request):
    latest_interview = Interview.objects.order_by('-scheduled_at')
    apps = (Application.objects
        .filter(user=request.user)
        .select_related('candidate', 'job__company', 'status')
        .prefetch_related(Prefetch('interviews', queryset=latest_interview, to_attr='_recent_interviews'))
        .annotate(notes_count=Count('notes'))
        .order_by('-created')[:50])
    return render(request, 'dashboard.html', {'applications': apps})

Result: 187 queries → 2 queries; p95 dashboard latency 1.4s → 280ms. Just this fix moved the average database CPU on the primary from ~70% to ~25%.

Fix 2 — Composite indexes on the hot tables

pg_stat_statements showed three queries doing sequential scans on million-row tables despite "having indexes." On inspection, the existing indexes covered single columns; the actual queries filtered on (user_id, status, created) composite criteria. Three new composite indexes, created with CONCURRENTLY so production stayed up:

migration · indexes added concurrently in production python
class Migration(migrations.Migration):
    atomic = False   # required for CONCURRENTLY

    operations = [
        migrations.RunSQL(
            "CREATE INDEX CONCURRENTLY IF NOT EXISTS "
            "idx_app_user_status_created ON applications_application "
            "(user_id, status_id, created DESC);",
            reverse_sql="DROP INDEX IF EXISTS idx_app_user_status_created;",
        ),
        # …two more, same shape
    ]

Result: the three slowest non-dashboard queries each dropped from ~200ms to <15ms. Replica lag during peak hour fell from ~2s to <100ms — a side benefit nobody had predicted.

Fix 3 — PgBouncer in front of RDS

The web fleet held ~280 concurrent Postgres connections at peak; RDS max_connections was 400. One bad deploy and the platform was 30% away from connection exhaustion. We introduced PgBouncer in transaction mode in front of the primary and replicas; app-side connection pool became effectively infinite; RDS connection count settled at ~45 in steady state.

Result: nominal latency overhead (~1ms per connection acquisition), but the safety margin paid for itself the first time a colleague accidentally shipped a long-running migration during business hours. No incident.

Fix 4 — Multi-tier Redis cache for the webhook receiver

The ATS-integration webhook hit a single read query (lookup applicant by external-system ID) ~600k times per day. Caching that lookup with a 5-minute TTL and explicit invalidation on update reduced webhook database load by 94%. The endpoint p95 latency went from 240ms → 38ms; the integration partner noticed and complimented the SLA.

Fix 5 — Move synchronous fan-out to Celery

The "submit application" endpoint synchronously sent four notifications, generated a PDF receipt, and wrote to two third-party HR systems. p95: 4.2 seconds, blocking a web worker for the duration. We moved all five side effects to Celery tasks with explicit idempotency keys. Endpoint p95: 180ms. Application throughput per worker tripled.

Fix 6 — Killed the dead views

Production access logs revealed that 11 view functions had received zero traffic in the last 90 days. Some had nightly cron jobs hitting them anyway. We removed 4,200 lines of code, decommissioned the cron jobs, and dropped two now-unused database tables (after a 2-week grace period and a confirmation that nothing was reading them).

Result: smaller blast radius, less code to maintain, and a small but real reduction in the application's memory footprint.

044. The infrastructure rightsizing that followed

After the code work, the application was using a small fraction of the database CPU it had been provisioned for. Sustaining the same RDS instance was paying for capacity we no longer needed. We staged a series of rightsizing steps over two weeks:

  • Dropped 3 of the 6 read replicas. The remaining 3 were still under 30% peak CPU.
  • Right-sized RDS primary from db.m6g.2xlarge to db.m6g.xlarge (half the size). Peak CPU rose from ~25% to ~50% — comfortable.
  • Cut web workers from 14 to 10. p99 latency unaffected; CPU headroom intact.
  • Switched Celery workers to mixed Spot instances for non-critical queues. ~70% saving on that portion of the EC2 bill.
  • Reduced ElastiCache cluster size by one node. Working set shrank because we deleted the queries that needed the cached data in the first place.

Each step was rolled out behind a feature flag with observability in place, and rolled back on the one we got wrong (Spot reclamation initially caused a queue-age SLO breach until we added a fallback on-demand worker pool). Total infrastructure changes: net £4,800/month saved on top of the code wins.

055. Tangible business results

The numbers the CTO took to the board. Same six-week window for measurement, before-vs-after, no traffic-growth adjustment:

before / after · representative numbers £ gbp
METRIC                                    BEFORE          AFTER           DELTA
─────────────────────────────────────────────────────────────────────────────
Monthly AWS bill                          £12,600/mo       £7,330/mo        −42%
Annualised saving                         —                £63,240/yr       —
RDS instance size                         db.m6g.2xlarge   db.m6g.xlarge    half
Read replicas                             6                3                −50%
Web worker count                          14               10               −29%

p95 latency (dashboard)                   1,420 ms         280 ms           −80%
p95 latency (submit application)          4,200 ms         180 ms           −96%
p95 latency (webhook receiver)            240 ms           38 ms            −84%

Database queries / dashboard request      187              2                −99%
RDS primary peak CPU                      ~70%             ~50%             healthier
Connection-exhaustion close calls / mo    3                0                eliminated

Saturday-night pages (last 90d)           2                0                quiet
Customer "slow page" tickets / mo         ~24              ~3                −88%

The commercial result

  • Recurring annualised saving: £63,240/year. Net of the engagement fee, payback in <4 months. Compounding from there forward.
  • Headcount unblocked. The CTO had been planning to hire a second platform engineer to "deal with growth." That headcount got redeployed to a revenue-generating product squad instead.
  • SLO restored. The platform now sits inside the 500ms p95 target the customer-success team had quietly given up on. Renewal conversations went easier.
  • On-call quality. Zero out-of-hours pages in the 90 days after delivery. The engineering team's retention conversation got measurably easier.
  • Capacity headroom. The platform now has ~2× the headroom it had before, without spending more. The next growth tier doesn't require infrastructure decisions for at least 12 months.
The thing CTOs find counter-intuitive

Most "performance problems" are not language problems or framework problems. They are measurement problems. Once you can see what's actually slow and what it's costing you, the fixes are usually well-understood patterns applied in priority order. The discovery sprint is where the value gets unlocked; the build phase just collects it.

066. What we'd do differently

Two honest reflections from the post-engagement retrospective:

  • We deployed OpenTelemetry traces during the engagement and kept the Honeycomb integration on a trial account. If we did it again, we'd recommend the client keep the tracing infrastructure long-term — the next regression will hide for weeks otherwise. We've since folded "permanent tracing setup" into the default Discovery Sprint deliverable list.
  • We rightsized infrastructure two weeks after the code work shipped. In hindsight, we should have waited four weeks. Traffic patterns vary by month-end / payroll-cycle in this domain, and we missed one peak entirely. Result: a single tense Friday where the right-sized RDS primary briefly hit 90% CPU. Easy fix (one instance class up for two days, then back down), but it was avoidable.

077. Common questions

"How transferable are these results to other Django platforms?"

Highly. Of the six fixes above, the first three (N+1 elimination, composite indexes, PgBouncer) appear on nearly every Django performance audit we run. The next three are situational but common. The percentage savings vary — we've seen anywhere from 20% to 55% in similar engagements — but the methodology is the same every time. If your platform has been growing for >18 months without dedicated performance work, the leverage is almost certainly there.

"Could we do this ourselves?"

Many teams can, with discipline. The two things that derail most in-house attempts: not having dedicated time (performance work loses every sprint planning to feature work) and not measuring before you optimise (engineers fix what they remember rather than what's actually slow). If you can carve out a focused engineer for 4–6 weeks and commit to measurement-first, the outcome is achievable in-house. If not, a fixed-scope external engagement removes both blockers.

"What does this kind of engagement actually cost?"

For this client: £8k for the 2-week Discovery Sprint plus £34k for the 4-week build squad — £42k total. Payback in under four months. We publish the prices on the Django rescue practice page and the Discovery Sprints page; what we deliver is the same regardless of the contracting framework.

"Did anything not work?"

Two attempted fixes didn't earn their keep. (1) We tried moving the dashboard to a Django async view to overlap external service calls — measured no improvement because the bottleneck was the database, not external I/O. We rolled it back. (2) We experimented with materialised views for two complex aggregates. The maintenance burden of refresh scheduling wasn't worth the marginal speed gain; we replaced them with simpler covering indexes. Both decisions are documented in the architecture decision records the client now owns.

"What's the right time to engage on this kind of work?"

When your monthly infrastructure bill starts to feel "unexpectedly large", when p95 latency on your hot endpoints has drifted past your customer-facing SLO, or when on-call pages have become a recruiting-and-retention liability. Don't wait for a Saturday-night incident to force the conversation. A 2-week Discovery Sprint is structured exactly to give you the evidence you need before committing to a larger engagement.

If your Django platform is showing any of those symptoms and you'd like a sanity check, the Discovery Sprint is fixed-price, written outputs, no obligation. The methodology described here is the one we apply every time.