webbyfox/ services/ applied ai, llmops & enterprise rag

03 · AI practice

Applied AI, LLMOps & enterprise RAG for Django.

For Tech Leads and Product Directors who want production-ready AI inside their existing Django platform — not strung together with a third-party SaaS, a parallel Node service, and three vendor invoices. pgvector lives next to your ORM; LangGraph runs on Celery; cost and latency are tracked in your own admin.

Sound familiar?

The "AI feature" is bolted on as a third-party widget and your team has no view into what it costs or why it's slow
Embeddings live in Pinecone, content lives in Postgres, and the two are constantly out of sync
Your RAG search returns plausible-sounding nonsense whenever the user types a brand name or a specific code
An LLM agent has been "almost shippable" for four months because no one trusts it without a human checkpoint
Monthly OpenAI bill has tripled and nobody can explain which feature drove it
Procurement is asking for a security review and you can't even tell them which prompts are in production

AI.01 / Architecture

Domain-integrated AI, not bolt-on SaaS.

The fastest way to a fragile AI feature is to put the embeddings somewhere your application server can't reach in a transaction. We build the AI layer straight into the Django ORM using pgvector — embeddings live next to the records they describe, get updated in the same transaction, and back up with the same nightly dump.

// concern

External vector DB (Pinecone, Weaviate)

pgvector inside your Postgres

consistency

Two systems; sync via webhooks or nightly jobs; eventually-stale embeddings.

One transaction. Update the model, embed in the same atomic() block, or it didn't happen.

operational surface

Second database to back up, monitor, secure, version and pay for.

Zero extra infrastructure. Embeddings are columns. Your existing Postgres tooling already monitors them.

filtering

Limited metadata filters; cross-system joins are impossible.

Native SQL. Join, filter by tenant, restrict by ACL — anything Postgres can express.

disaster recovery

Two restore procedures; embeddings drift from source on partial restore.

One pg_restore. Embeddings and source data come back in lockstep.

cost at typical scale

£300–£3,000/mo from ~1M vectors.

£0 marginal until you outgrow your Postgres instance — usually past 50M rows.

AI.02 / Retrieval

Hybrid search & Reciprocal Rank Fusion.

Pure-semantic search is great for paraphrases ("how do I cancel" → "subscription termination") and terrible for exact tokens (product SKUs, brand names, error codes). Pure-lexical is the opposite. The fix is hybrid retrieval — run both and merge the rankings with Reciprocal Rank Fusion, which is robust to wildly different score distributions because it operates on rank, not raw score.

RRF_Score(d) = Σ_{m ∈ M} 1 / ( k + r_m(d) ) Where d is a document, M is the set of rankers (one lexical, one semantic), r_m(d) is the rank of d in ranker m, and k is a smoothing constant — 60 is the canonical choice from Cormack et al. Higher RRF score = higher final rank. Tunable, explainable, no per-corpus normalisation needed.

search/hybrid.py · django + pgvector + bm25 + rrf PYTHON

# Hybrid retrieval: lexical (PG full-text / BM25) + dense (pgvector cosine)
# merged with Reciprocal Rank Fusion. k=60 is the canonical smoothing.
from django.db import connection
from django.contrib.postgres.search import SearchQuery, SearchRank
from pgvector.django import CosineDistance

from .models import Doc
from .embeddings import embed_query


def hybrid_search(query: str, k: int = 60, top_n: int = 8) -> list[Doc]:
    q_vec = embed_query(query)
    q_lex = SearchQuery(query, search_type='websearch')

    # 1. lexical leg — ranked by Postgres FTS
    lex = (Doc.objects
        .annotate(rank=SearchRank('search_vector', q_lex))
        .filter(search_vector=q_lex)
        .order_by('-rank')[:50])

    # 2. dense leg — ranked by cosine distance against pgvector
    dense = (Doc.objects
        .annotate(distance=CosineDistance('embedding', q_vec))
        .order_by('distance')[:50])

    # 3. Reciprocal Rank Fusion
    scores: dict[int, float] = {}
    for rank, doc in enumerate(lex, start=1):
        scores[doc.id] = scores.get(doc.id, 0) + 1 / (k + rank)
    for rank, doc in enumerate(dense, start=1):
        scores[doc.id] = scores.get(doc.id, 0) + 1 / (k + rank)

    top_ids = sorted(scores, key=scores.get, reverse=True)[:top_n]
    # Preserve fused ordering when materialising
    docs = {d.id: d for d in Doc.objects.filter(id__in=top_ids)}
    return [docs[i] for i in top_ids if i in docs]

In practice we also add a re-ranker (BGE-reranker, Cohere, or a fine-tuned cross-encoder) on the top 20 fused results when the corpus is large or the recall budget allows. The RRF stage is the cheap, robust glue; the re-ranker is the optional precision boost.

AI.03 / Orchestration

Resilient agent workflows: LangGraph + Celery.

Long-running LLM chains that hold an HTTP request open are how you ship 502s. We model multi-step agents as LangGraph state machines, persist execution state to Postgres, and run them inside Celery workers. The graph can pause for a human checkpoint, wait on a webhook, retry a failed tool call with backoff — and resume from the exact node it stopped at. Hours later. Across a deploy.

G.01

State persists, not memory

Graph state is serialised to a Django JSONField after every node transition. If the worker crashes mid-run, a replacement worker picks the run back up — at the node it died on, with the inputs it died with.

crash-safe

G.02

Human-in-the-loop nodes

A node can suspend the graph and notify a reviewer. When they approve or edit the proposed action, the same Celery task chain resumes with the human's input merged into the state.

async approval

G.03

Tool-call timeouts & retry policies

Each tool call has an explicit timeout, max retries and exponential backoff. Failures get classified (transient vs permanent) and routed accordingly — never a silent retry loop.

resilient

G.04

Streaming responses

The LLM token stream is fan-routed through Redis Pub/Sub to the user's open SSE connection. If they reconnect, they get the tail from a persistent log — no lost output.

SSE · resumable

G.05

Replayable execution traces

Every node call is logged with inputs, outputs and elapsed time. Triage incidents by replaying the exact graph that misbehaved, in a staging environment, against the same model and prompt version.

replay

AI.04 / Observability

LLM observability & token-budget dashboards.

If you can't tell which feature, which prompt version and which user cohort drove yesterday's bill, you don't have a production AI system — you have a research project with a credit card attached. Every LLM call we ship is wrapped, versioned, and accounted for, with the data flowing straight into your Django admin.

Per-call telemetry

Model, prompt version, input tokens, output tokens, latency, cost-in-£, tool calls and final outcome — logged to a Django model on every call.

Prompt versioning DB

Every prompt template is content-addressed and versioned. A/B switching is a single Django admin change. Rollbacks are instant and auditable.

Budget caps & alerts

Per-feature daily and monthly token budgets. Soft alert at 70%, hard stop at 100%. Alerts route to Slack, PagerDuty or your incident channel of choice.

Sentry / OTel integration

LLM latency, error rate and token spend are first-class metrics in Sentry and OpenTelemetry. Treat them the way you'd treat any other production service.

Offline eval suite

Ground-truth dataset, judge-LLM or rules-based scoring, pass-rate tracked per prompt version. Eval runs are gated in CI — regressions block merges.

PII redaction & logging policy

Configurable redaction before any prompt or response is logged. Retention policy enforced at the database level. GDPR-compatible from day one.

AI.05 / Fit

Is this the right engagement for you?

Yes, if… we're a fit

You have an existing Django (or DRF) platform you want to extend with AI
Your content/data already lives in Postgres or is migrating there
You want explainable, auditable AI features — not a black box your team can't debug
Regulated industry: legal, healthcare, fintech, public sector, education
You'd rather own the stack than rent it forever

Probably not, if… consider others

The fastest path is dropping a vendor widget in a marketing site
You have no Django footprint and won't build one
You want a one-week prototype with no production handover
You expect frontier-model research, not applied product engineering

AI.06 / FAQ

Common questions.

Why pgvector instead of a dedicated vector database?

Three reasons: one source of truth (embeddings live next to the records they describe, in the same transaction), no synchronisation lag (an update to a Django model updates the embedding in the same transaction), and dramatically lower operational surface (one database to back up, monitor and secure instead of two). pgvector handles tens of millions of vectors comfortably on a single Postgres instance — most production RAG systems never need anything more.

What is Reciprocal Rank Fusion and why does it matter for hybrid search?

RRF is the standard algorithm for merging two ranked result lists (lexical and semantic) without needing to normalise their scores. It works because rank-based fusion is robust to wildly different score distributions — BM25 and cosine similarity don't have comparable units, but their ranks always do. The smoothing constant k (canonically 60) softens the effect of high ranks so that good results in either list survive the merge.

How do you handle long-running LLM workflows that need to wait for human review?

LangGraph state machines, persisted to Postgres, executed inside Celery workers. The graph can pause at any node — for human approval, external webhook, or scheduled retry — and resume from exactly where it left off. We never hold an HTTP request open for an LLM chain; the graph is the unit of execution, not the request. This is also how we survive deploys mid-conversation.

How do you track LLM costs and prevent runaway token spend?

Every LLM call goes through a thin wrapper that logs model, prompt version, token usage and latency to a Django model. A dashboard in the admin shows daily spend by feature, by user cohort, and by prompt version. Hard budget caps fire alerts and can stop a feature before it costs the company. The same telemetry flows to Sentry and OpenTelemetry for performance regressions.

Can you run open-source models on our own infrastructure?

Yes — Llama, Mistral, Qwen, and the fine-tuned derivatives. We deploy them on your VPC or on-prem with vLLM, handle quantisation and batching, and produce the security paperwork (data flow diagrams, model cards, retention policy) that procurement and legal need. Suitable for regulated and public-sector clients who can't send data to a third-party LLM provider.

What about evals — how do you know the AI is actually working?

An offline eval suite with a ground-truth dataset, scored either by rules or an LLM judge depending on the task. Pass-rate is tracked per prompt version. Eval runs are gated in CI: a prompt change that drops eval pass-rate below threshold blocks the merge. We ship the eval suite to your repo on day one — you own it, not us.

Ready to ship AI you can actually own?
Let's scope it.

AI discovery sprint

2 weeks · fixed-price · £8k

RAG build

6–10 weeks · from £35k

Agent & LLMOps squad

Rolling monthly · £18–28k

Start a conversation ↗ Book a 15-min call