03 · AI practice
Applied AI, LLMOps & enterprise RAG for Django.
For Tech Leads and Product Directors who want production-ready AI inside their existing Django platform — not strung together with a third-party SaaS, a parallel Node service, and three vendor invoices. pgvector lives next to your ORM; LangGraph runs on Celery; cost and latency are tracked in your own admin.
Sound familiar?
- The "AI feature" is bolted on as a third-party widget and your team has no view into what it costs or why it's slow
- Embeddings live in Pinecone, content lives in Postgres, and the two are constantly out of sync
- Your RAG search returns plausible-sounding nonsense whenever the user types a brand name or a specific code
- An LLM agent has been "almost shippable" for four months because no one trusts it without a human checkpoint
- Monthly OpenAI bill has tripled and nobody can explain which feature drove it
- Procurement is asking for a security review and you can't even tell them which prompts are in production
Domain-integrated AI, not bolt-on SaaS.
The fastest way to a fragile AI feature is to put the embeddings somewhere your application server can't reach in a transaction. We build the AI layer straight into the Django ORM using pgvector — embeddings live next to the records they describe, get updated in the same transaction, and back up with the same nightly dump.
atomic() block, or it didn't happen.pg_restore. Embeddings and source data come back in lockstep.Hybrid search & Reciprocal Rank Fusion.
Pure-semantic search is great for paraphrases ("how do I cancel" → "subscription termination") and terrible for exact tokens (product SKUs, brand names, error codes). Pure-lexical is the opposite. The fix is hybrid retrieval — run both and merge the rankings with Reciprocal Rank Fusion, which is robust to wildly different score distributions because it operates on rank, not raw score.
d is a document, M is the set of rankers (one lexical, one semantic), rm(d) is the rank of d in ranker m, and k is a smoothing constant — 60 is the canonical choice from Cormack et al. Higher RRF score = higher final rank. Tunable, explainable, no per-corpus normalisation needed.
# Hybrid retrieval: lexical (PG full-text / BM25) + dense (pgvector cosine) # merged with Reciprocal Rank Fusion. k=60 is the canonical smoothing. from django.db import connection from django.contrib.postgres.search import SearchQuery, SearchRank from pgvector.django import CosineDistance from .models import Doc from .embeddings import embed_query def hybrid_search(query: str, k: int = 60, top_n: int = 8) -> list[Doc]: q_vec = embed_query(query) q_lex = SearchQuery(query, search_type='websearch') # 1. lexical leg — ranked by Postgres FTS lex = (Doc.objects .annotate(rank=SearchRank('search_vector', q_lex)) .filter(search_vector=q_lex) .order_by('-rank')[:50]) # 2. dense leg — ranked by cosine distance against pgvector dense = (Doc.objects .annotate(distance=CosineDistance('embedding', q_vec)) .order_by('distance')[:50]) # 3. Reciprocal Rank Fusion scores: dict[int, float] = {} for rank, doc in enumerate(lex, start=1): scores[doc.id] = scores.get(doc.id, 0) + 1 / (k + rank) for rank, doc in enumerate(dense, start=1): scores[doc.id] = scores.get(doc.id, 0) + 1 / (k + rank) top_ids = sorted(scores, key=scores.get, reverse=True)[:top_n] # Preserve fused ordering when materialising docs = {d.id: d for d in Doc.objects.filter(id__in=top_ids)} return [docs[i] for i in top_ids if i in docs]
In practice we also add a re-ranker (BGE-reranker, Cohere, or a fine-tuned cross-encoder) on the top 20 fused results when the corpus is large or the recall budget allows. The RRF stage is the cheap, robust glue; the re-ranker is the optional precision boost.
Resilient agent workflows: LangGraph + Celery.
Long-running LLM chains that hold an HTTP request open are how you ship 502s. We model multi-step agents as LangGraph state machines, persist execution state to Postgres, and run them inside Celery workers. The graph can pause for a human checkpoint, wait on a webhook, retry a failed tool call with backoff — and resume from the exact node it stopped at. Hours later. Across a deploy.
State persists, not memory
Graph state is serialised to a Django JSONField after every node transition. If the worker crashes mid-run, a replacement worker picks the run back up — at the node it died on, with the inputs it died with.
Human-in-the-loop nodes
A node can suspend the graph and notify a reviewer. When they approve or edit the proposed action, the same Celery task chain resumes with the human's input merged into the state.
Tool-call timeouts & retry policies
Each tool call has an explicit timeout, max retries and exponential backoff. Failures get classified (transient vs permanent) and routed accordingly — never a silent retry loop.
Streaming responses
The LLM token stream is fan-routed through Redis Pub/Sub to the user's open SSE connection. If they reconnect, they get the tail from a persistent log — no lost output.
Replayable execution traces
Every node call is logged with inputs, outputs and elapsed time. Triage incidents by replaying the exact graph that misbehaved, in a staging environment, against the same model and prompt version.
LLM observability & token-budget dashboards.
If you can't tell which feature, which prompt version and which user cohort drove yesterday's bill, you don't have a production AI system — you have a research project with a credit card attached. Every LLM call we ship is wrapped, versioned, and accounted for, with the data flowing straight into your Django admin.
Per-call telemetry
Model, prompt version, input tokens, output tokens, latency, cost-in-£, tool calls and final outcome — logged to a Django model on every call.
Prompt versioning DB
Every prompt template is content-addressed and versioned. A/B switching is a single Django admin change. Rollbacks are instant and auditable.
Budget caps & alerts
Per-feature daily and monthly token budgets. Soft alert at 70%, hard stop at 100%. Alerts route to Slack, PagerDuty or your incident channel of choice.
Sentry / OTel integration
LLM latency, error rate and token spend are first-class metrics in Sentry and OpenTelemetry. Treat them the way you'd treat any other production service.
Offline eval suite
Ground-truth dataset, judge-LLM or rules-based scoring, pass-rate tracked per prompt version. Eval runs are gated in CI — regressions block merges.
PII redaction & logging policy
Configurable redaction before any prompt or response is logged. Retention policy enforced at the database level. GDPR-compatible from day one.
Is this the right engagement for you?
Yes, if… we're a fit
- You have an existing Django (or DRF) platform you want to extend with AI
- Your content/data already lives in Postgres or is migrating there
- You want explainable, auditable AI features — not a black box your team can't debug
- Regulated industry: legal, healthcare, fintech, public sector, education
- You'd rather own the stack than rent it forever
Probably not, if… consider others
- The fastest path is dropping a vendor widget in a marketing site
- You have no Django footprint and won't build one
- You want a one-week prototype with no production handover
- You expect frontier-model research, not applied product engineering
Common questions.
Why pgvector instead of a dedicated vector database?
Three reasons: one source of truth (embeddings live next to the records they describe, in the same transaction), no synchronisation lag (an update to a Django model updates the embedding in the same transaction), and dramatically lower operational surface (one database to back up, monitor and secure instead of two). pgvector handles tens of millions of vectors comfortably on a single Postgres instance — most production RAG systems never need anything more.
What is Reciprocal Rank Fusion and why does it matter for hybrid search?
RRF is the standard algorithm for merging two ranked result lists (lexical and semantic) without needing to normalise their scores. It works because rank-based fusion is robust to wildly different score distributions — BM25 and cosine similarity don't have comparable units, but their ranks always do. The smoothing constant k (canonically 60) softens the effect of high ranks so that good results in either list survive the merge.
How do you handle long-running LLM workflows that need to wait for human review?
LangGraph state machines, persisted to Postgres, executed inside Celery workers. The graph can pause at any node — for human approval, external webhook, or scheduled retry — and resume from exactly where it left off. We never hold an HTTP request open for an LLM chain; the graph is the unit of execution, not the request. This is also how we survive deploys mid-conversation.
How do you track LLM costs and prevent runaway token spend?
Every LLM call goes through a thin wrapper that logs model, prompt version, token usage and latency to a Django model. A dashboard in the admin shows daily spend by feature, by user cohort, and by prompt version. Hard budget caps fire alerts and can stop a feature before it costs the company. The same telemetry flows to Sentry and OpenTelemetry for performance regressions.
Can you run open-source models on our own infrastructure?
Yes — Llama, Mistral, Qwen, and the fine-tuned derivatives. We deploy them on your VPC or on-prem with vLLM, handle quantisation and batching, and produce the security paperwork (data flow diagrams, model cards, retention policy) that procurement and legal need. Suitable for regulated and public-sector clients who can't send data to a third-party LLM provider.
What about evals — how do you know the AI is actually working?
An offline eval suite with a ground-truth dataset, scored either by rules or an LLM judge depending on the task. Pass-rate is tracked per prompt version. Eval runs are gated in CI: a prompt change that drops eval pass-rate below threshold blocks the merge. We ship the eval suite to your repo on day one — you own it, not us.