Every conversation about "scaling Python" eventually arrives at the same place: the language is rarely the bottleneck. The bottleneck is architecture. A well-shaped Django monolith on Python 3.12 will outperform a badly-shaped microservices mesh in any language, on the same hardware, every time. The patterns below are the ones we reach for on every production engagement — they compose, they're cheap to introduce, and they each pay for themselves inside the first traffic spike.

All five examples below assume Django + Postgres + Redis, because that is the stack we ship most often. They translate cleanly to FastAPI, Flask, or Starlette with minimal change.

011. Modular monolith vs microservices in Python

The default answer in 2026 is still a modular monolith, and the burden of proof is on whoever wants to split it. Microservices are a deploy-time-independence pattern; they cost you network round-trips, distributed-transaction complexity, observability surface, and at least 3–10× the operational overhead. You take that cost on purpose, when you have a genuine reason — not by default.

What a modular monolith actually looks like in Django

One repo, one deploy, multiple bounded contexts expressed as Django apps with intentionally weak coupling between them. The discipline is in the import graph:

  • Apps own their data. No app reaches into another app's models. Cross-app access goes through an explicit service-layer function or a published interface module.
  • Service layers, not fat models. Business rules live in app/services.py, typed with Pydantic or dataclasses. Views become thin HTTP adapters. Tests run in milliseconds.
  • Linted boundaries. Tools like import-linter enforce the rules. A PR that imports billing.models from users.views fails CI.
  • Single deploy, multiple worker pools. Same image, different command lines — web, celery-default, celery-priority, celery-low. Failure isolation without microservice plumbing.

When microservices actually earn their keep

  • Independent deploy cadence is a real requirement — e.g. a checkout team needs to ship daily while a payments-ledger team ships monthly with regulator review. Same monolith forces an artificial lockstep.
  • Polyglot or non-Python services exist — ML inference in Go, real-time websockets in Elixir, an existing C++ pricing engine. Network boundaries make sense.
  • Team scale > ~50 engineers where the merge-queue contention on a single repo becomes a daily blocker. Conway's Law: at some scale, the architecture maps to the org chart whether you want it to or not.
  • Compliance / blast-radius isolation — PCI-scoped code lives in its own service with its own audit trail and access controls.
The framing that matters

You do not "graduate" from monolith to microservices. They are different tools with different costs. Monzo runs microservices because they decided to; Instagram famously runs a single Django codebase serving billions of users because they decided not to. Both are correct for their context.

022. Multi-tier caching

Caching is not "add Redis." Caching is a five-tier strategy where every tier deletes work from the tier below it. Get this right and your Python process barely sees the traffic at all.

  • Tier 1 — Browser. Set Cache-Control aggressively on static assets, conservatively on HTML. Versioned filenames (style.HASH.css) let you cache CSS for a year safely.
  • Tier 2 — Edge / CDN. Cloudflare, Fastly, CloudFront. For anonymous traffic on content-led routes (marketing pages, blog, public APIs), the CDN absorbs 95%+ of requests before they hit your origin.
  • Tier 3 — Reverse proxy. Varnish or nginx in front of your app servers. Useful for handling spikes and SSL termination; less critical if you have a strong CDN tier.
  • Tier 4 — Application cache. Redis (or Memcached) for query results, computed views, session data, rate-limit counters. This is where most teams start.
  • Tier 5 — Database query cache. Postgres has effective shared-buffer caching automatically. The work is in tuning shared_buffers, effective_cache_size, and giving the box enough RAM.

The Redis patterns that matter

Three patterns cover 90% of application-cache use cases. Get these right before adding anything exotic.

cache patterns · django + redis python
from django.core.cache import cache
from django.utils import timezone

CACHE_VERSION = 7  # bump to invalidate every entry at once


# Pattern 1 — read-through with stampede protection.
# cache.get_or_set is atomic under the redis backend with timeout-driven lock.
def homepage_payload():
    return cache.get_or_set(
        key=f'home:v{CACHE_VERSION}',
        default=_build_homepage_payload,   # callable — runs on miss only
        timeout=120,
    )


# Pattern 2 — explicit invalidation on write. TTL is a backstop, not a primary.
def on_article_save(article):
    cache.delete_many([
        f'home:v{CACHE_VERSION}',
        f'article:{article.slug}:v{CACHE_VERSION}',
        f'feed:{article.section_id}:v{CACHE_VERSION}',
    ])


# Pattern 3 — short-lived idempotency for write endpoints.
def claim_idempotency_key(key: str, ttl: int = 60) -> bool:
    # SETNX semantics — returns True if we acquired the lock,
    # False if a duplicate request is already being processed.
    return cache.add(f'idem:{key}', timezone.now().isoformat(), timeout=ttl)

Three rules to live by: name your cache keys with a version so a one-line bump invalidates everything; invalidate on write rather than relying on TTL expiry (TTL is a backstop, not your primary mechanism); and always use stampede-protected helpers (get_or_set, locks) on expensive cache misses, otherwise a popular key expiring will fire n concurrent rebuilds and DDoS your own database.

033. Asynchronous task queues with Celery

If a request takes more than 100ms and the work doesn't have to happen before you return a response, it belongs in a Celery task. The point isn't "fewer requests" — it's that the request handler stays predictable. Web workers are scarce; task workers are cheap to add and easy to autoscale.

What goes in a task

  • Email, SMS, push notifications (third-party APIs that can fail or rate-limit)
  • Webhook delivery with exponential backoff
  • PDF / report generation, statement runs, exports
  • Image processing, video encoding, file scanning
  • Search-index updates, cache warming, denormalisation
  • Periodic jobs (Celery Beat): nightly reconciliation, cleanup, billing
  • Long-running LLM chains via LangGraph state machines persisted to Postgres

Task design that survives production

A production task isn't a function with @app.task stuck on it. There's a small but mandatory checklist:

tasks/emails.py · production-grade celery task python
from celery import shared_task
from celery.exceptions import Retry
import structlog

log = structlog.get_logger()


@shared_task(
    bind=True,
    queue='priority',                # 1. dedicated queue, not 'default'
    max_retries=5,
    retry_backoff=True,              # 2. exponential — 1s, 2s, 4s, 8s, 16s
    retry_backoff_max=300,
    retry_jitter=True,               # 3. jitter — avoid thundering herd
    autoretry_for=(ConnectionError, TimeoutError),
    acks_late=True,                  # 4. only ack after success — survives worker crash
    task_time_limit=120,              # 5. hard limit — task gets killed at this
    task_soft_time_limit=90,          # 6. soft — raises SoftTimeLimitExceeded for graceful exit
)
def send_welcome_email(self, user_id: int):
    try:
        user = User.objects.get(pk=user_id)
        _deliver_via_provider(user)
        log.info('welcome_email.sent', user_id=user_id, attempt=self.request.retries)
    except ProviderRateLimited as exc:
        # Explicit retry with the provider's Retry-After header
        raise self.retry(exc=exc, countdown=exc.retry_after)
  • Dedicated queues by priority and SLO. Send-email goes in priority; nightly-cleanup goes in low. Worker pools scale independently. A backlog in one doesn't starve the other.
  • acks_late=True means the task is only acknowledged after it succeeds — so if the worker crashes mid-execution, the broker redelivers. Without it, you lose work.
  • Soft + hard time limits turn "task is taking forever" into a visible failure with a clean stack trace.
  • Idempotent design. Network is unreliable; tasks may run twice. Use idempotency keys, database upserts, and conditional updates rather than blind increments.
  • Observability is not optional. Flower for queue visibility, structured logs with task IDs, Sentry for failures, and an SLO dashboard for backlog age (95th percentile time-from-enqueue-to-start). When the backlog grows, you find out before customers do.
The architectural pattern

Web workers handle the request; Celery handles the consequences. If the consequences themselves get complex (multi-step, human-in-the-loop, retry-with-state), graduate from raw tasks to a LangGraph state machine backed by the same Celery infrastructure.

044. Database tier scaling

For 95% of Python web platforms, the database is the single biggest scaling constraint. Get the tier right and you can defer "rewrite in <Go|Rust|Whatever>" by a decade.

Connection pooling with PgBouncer — non-negotiable

Postgres connections are expensive. Each one consumes 5–10MB of RAM, holds a backend process, and is rate-limited by max_connections (typically 100–500 on a managed instance). Django's CONN_MAX_AGE helps within a process, but with multiple workers and ASGI you'll exhaust connections fast.

PgBouncer in transaction mode multiplexes thousands of client connections onto a small pool of real Postgres connections. The numbers we typically see post-introduction: max_connections=200, app-side pool size 800, peak utilisation 40%. The application has effectively infinite connections; the database has the small number it can comfortably handle.

Read replicas + Django database routers

Read traffic dominates most workloads. Routing it to one or more read replicas takes load off the primary, which is your single bottleneck for writes. Django ships with a database-router protocol that makes this clean:

db/routers.py · read/write split with replica lag awareness python
import random
from django.db import transaction

REPLICAS = ['replica_a', 'replica_b']


class ReadWriteRouter:
    """Reads go to a random replica; writes and read-your-writes go to primary."""

    def db_for_read(self, model, **hints):
        # If we're inside an explicit atomic() block, force primary —
        # protects against replica lag inside a single business transaction.
        if transaction.get_connection().in_atomic_block:
            return 'default'
        return random.choice(REPLICAS)

    def db_for_write(self, model, **hints):
        return 'default'

    def allow_relation(self, obj1, obj2, **hints):
        return True          # same logical schema across all replicas

    def allow_migrate(self, db, app_label, model_name=None, **hints):
        return db == 'default'  # migrations only against primary

The subtle bit is read-your-writes safety. If a request creates an object and immediately reads it back, hitting the replica risks seeing stale data because of replication lag (typically <100ms but spikes happen). Forcing reads to primary inside an atomic() block is the simplest correct policy. For trickier scenarios — e.g. the response of a write being followed by an unrelated read 200ms later — use the using= hint explicitly.

The other database hygiene wins

  • Indexes — including composite and partial. Profile from a week of production slow-query logs, not from your mental model.
  • select_related / prefetch_related everywhere. An N+1 detector in CI prevents regressions. Our Django performance article goes deep on this.
  • Postgres-specific tools. BRIN indexes for time-series, partial indexes for active-rows-only, materialised views for slow aggregates, pg_stat_statements for ground truth on what's actually slow.
  • Statement timeouts. SET statement_timeout = '5s' at the connection level — a runaway query takes itself out before it takes down the database.

055. Async I/O & stateless workers

The fifth pattern is the modern shift that lets the other four scale linearly: handle I/O-bound work asynchronously, and keep every worker process stateless. The first lets one Python process handle 10× more concurrent requests; the second lets you scale horizontally without coordination.

When async views actually help

The big misconception is "async = faster." Async doesn't make code faster — it makes it concurrent. For a route that does one cheap Postgres query and renders a template, sync is faster (no event-loop overhead). For a route that fans out to three external APIs (sanctions check, fraud score, address lookup), async wins handily because the worker isn't blocked on the network.

views/checkout.py · async fan-out done right python
import asyncio
from django.http import JsonResponse
import httpx


async def prepare_checkout(request):
    async with httpx.AsyncClient(timeout=2.0) as client:
        # Three independent third-party calls — run in parallel, not sequentially
        sanctions, fraud, address = await asyncio.gather(
            _sanctions_check(client, request.user),
            _fraud_score(client, request),
            _validate_address(client, request.POST.get('address_id')),
            return_exceptions=True,
        )

    # A partial failure shouldn't block the whole checkout — degrade gracefully
    payload = {
        'sanctions_ok': not isinstance(sanctions, Exception) and sanctions.cleared,
        'fraud_score': getattr(fraud, 'score', None),
        'address_valid': not isinstance(address, Exception) and address.valid,
    }
    return JsonResponse(payload)

Run this on a sync worker and your three serial calls take ~600ms (200ms × 3). Run it on an ASGI worker (Uvicorn, Daphne, or Granian) and the same three calls take ~210ms. Same code, same database, one architecture decision — three times the throughput on this route.

Stateless workers — the prerequisite for horizontal scaling

The other half of the pattern: any worker can serve any request. No local file state, no in-memory session data, no "sticky session" routing. The instant a worker holds state, your scale-out story breaks — autoscalers cannot churn workers without losing user sessions, deploys become risky, and a single worker death takes affected users with it.

  • Sessions in Redis or signed cookies. Never in-process memory; never local file storage.
  • Uploaded files in S3 (or equivalent) immediately, not on the worker's disk. Use presigned URLs to let clients upload direct.
  • Configuration in environment variables, not in a worker-local file. django-environ or Pydantic Settings.
  • Idempotency keys at the API boundary. Clients can retry; you de-duplicate. Critical for any payment, write, or "create" endpoint exposed to the public internet.
  • Health endpoints that mean it. /healthz/ validates database, cache, and any critical external dependency — not just "is the process running."

Done well, this means you can scale from 2 → 20 → 200 worker processes by changing one number in the autoscaler. No coordination, no warm-up handshake, no session migration drama.

066. The combined reference stack

None of these patterns work in isolation. The version we land on most production engagements looks roughly like this:

reference architecture · python @ high traffic infra
                              ┌──────────────────────────┐
                              │   CDN (Cloudflare / Fastly)
                              │   absorbs 95% of anon traffic
                              └────────────┬─────────────┘
                                           │
                              ┌────────────▼─────────────┐
                              │   Load Balancer (ALB / Caddy)
                              │   stateless, no sticky sessions
                              └────────────┬─────────────┘
                                           │
              ┌────────────────────────────┼────────────────────────────┐
              │                            │                            │
       ┌──────▼──────┐             ┌──────▼──────┐             ┌──────▼──────┐
       │ uvicorn x N │             │ uvicorn x N │             │ uvicorn x N │
       │ (ASGI Django)             │             │             │             │
       └──────┬──────┘             └──────┬──────┘             └──────┬──────┘
              │                            │                            │
              └──────────────┬─────────────┴──────────────┬─────────────┘
                             │                            │
                      ┌──────▼──────┐              ┌──────▼──────┐
                      │  PgBouncer  │              │    Redis    │
                      │ transaction │              │ cache + broker
                      │    mode     │              └──────┬──────┘
                      └──────┬──────┘                     │
                             │                            │
              ┌──────────────┼──────────────┐             │
              │              │              │             │
       ┌──────▼──────┐ ┌─────▼──────┐ ┌────▼───────┐ ┌──▼────────────┐
       │  Postgres   │ │  Replica A │ │ Replica B  │ │  Celery x M   │
       │  PRIMARY    │ │  read-only │ │ read-only  │ │  (own pool)   │
       │  writes     │ │            │ │            │ │  send mail,
       │             │ │            │ │            │ │  webhooks,
       │             │ │            │ │            │ │  reports, etc.│
       └─────────────┘ └────────────┘ └────────────┘ └───────────────┘

The interesting numbers when this stack is running well:

  • p50 web latency: 20–60ms (most served from CDN; origin handles dynamic only)
  • p95 web latency: 120–300ms
  • Celery queue age (p95): <5s on priority queue, <60s on default
  • Database connections in use: <50% of pool at peak
  • Cache hit rate (Redis): >85% on hot paths
  • Cost per million requests: <£3 at typical UK cloud prices
If you take one thing away

You do not need to rewrite anything to get most of this. Each pattern can be introduced independently — caching first, then PgBouncer, then a Celery queue, then async on the one route that fans out. By the time you've applied all five, the platform looks completely different, but no single PR ever felt risky.

077. Common questions

"Should we just rewrite in Go / Rust / Node?"

Almost never. Every "Python is slow" story we've audited turned out to be N+1 queries, missing indexes, synchronous third-party calls, or a missing cache. The language change is the most expensive possible fix for a non-language problem — typically 12–18 months of engineering and a parallel feature freeze. Apply the five patterns above first; revisit the question only if you've maxed them out, which is rare.

"How do we know which pattern to apply first?"

Profile, then prescribe. A week of production traces tells you immediately. If your p95 is dominated by database query time → start with patterns 2 (cache) and 4 (database tier). If it's dominated by third-party API calls in the request path → start with patterns 3 (Celery) and 5 (async). If you're paging on weekends because of plugin CVEs or deploy contention → it's an architecture problem (pattern 1) more than a performance one.

"Does any of this change with AI features?"

The same five patterns still apply; the implementation details shift. LLM calls are slow third-party APIs (use patterns 3 + 5: Celery for chains, async for parallel fan-out). Embedding lookups are database reads (use pattern 4: pgvector on a read replica is perfectly fine). RAG retrieval is cacheable (pattern 2: cache the hybrid-retrieval result per query+context hash). Our Applied AI practice documents the AI-specific variants.

"How big does the team need to be?"

One or two senior engineers can introduce all five patterns inside a quarter, working alongside an existing in-house team. We've seen it done. The constraint is judgment — knowing which one to apply where — not headcount.

"Where do we go wrong most often?"

Three places. Cache invalidation by TTL alone (stale data drives engineers and users mad — invalidate on write). Celery tasks without idempotency (one duplicate billing event ruins everyone's day). Reading from replicas inside the same logical transaction as the write (race conditions that only show up in production). The patterns above bake in the defaults that avoid all three.

If your platform is running Python at scale and you'd like a sanity check, the 2-week Discovery Sprint is structured around exactly these questions — fixed price, written outputs, no obligation.