webbyfox/ services/ applied ai, llmops & enterprise rag

03 · AI practice

Applied AI, LLMOps & enterprise RAG for Django.

For Tech Leads and Product Directors who want production-ready AI inside their existing Django platform — not strung together with a third-party SaaS, a parallel Node service, and three vendor invoices. pgvector lives next to your ORM; LangGraph runs on Celery; cost and latency are tracked in your own admin.

Sound familiar?

  • The "AI feature" is bolted on as a third-party widget and your team has no view into what it costs or why it's slow
  • Embeddings live in Pinecone, content lives in Postgres, and the two are constantly out of sync
  • Your RAG search returns plausible-sounding nonsense whenever the user types a brand name or a specific code
  • An LLM agent has been "almost shippable" for four months because no one trusts it without a human checkpoint
  • Monthly OpenAI bill has tripled and nobody can explain which feature drove it
  • Procurement is asking for a security review and you can't even tell them which prompts are in production

Domain-integrated AI, not bolt-on SaaS.

The fastest way to a fragile AI feature is to put the embeddings somewhere your application server can't reach in a transaction. We build the AI layer straight into the Django ORM using pgvector — embeddings live next to the records they describe, get updated in the same transaction, and back up with the same nightly dump.

// concern
External vector DB (Pinecone, Weaviate)
pgvector inside your Postgres
consistency
Two systems; sync via webhooks or nightly jobs; eventually-stale embeddings.
One transaction. Update the model, embed in the same atomic() block, or it didn't happen.
operational surface
Second database to back up, monitor, secure, version and pay for.
Zero extra infrastructure. Embeddings are columns. Your existing Postgres tooling already monitors them.
filtering
Limited metadata filters; cross-system joins are impossible.
Native SQL. Join, filter by tenant, restrict by ACL — anything Postgres can express.
disaster recovery
Two restore procedures; embeddings drift from source on partial restore.
One pg_restore. Embeddings and source data come back in lockstep.
cost at typical scale
£300–£3,000/mo from ~1M vectors.
£0 marginal until you outgrow your Postgres instance — usually past 50M rows.

Resilient agent workflows: LangGraph + Celery.

Long-running LLM chains that hold an HTTP request open are how you ship 502s. We model multi-step agents as LangGraph state machines, persist execution state to Postgres, and run them inside Celery workers. The graph can pause for a human checkpoint, wait on a webhook, retry a failed tool call with backoff — and resume from the exact node it stopped at. Hours later. Across a deploy.

G.01

State persists, not memory

Graph state is serialised to a Django JSONField after every node transition. If the worker crashes mid-run, a replacement worker picks the run back up — at the node it died on, with the inputs it died with.

crash-safe
G.02

Human-in-the-loop nodes

A node can suspend the graph and notify a reviewer. When they approve or edit the proposed action, the same Celery task chain resumes with the human's input merged into the state.

async approval
G.03

Tool-call timeouts & retry policies

Each tool call has an explicit timeout, max retries and exponential backoff. Failures get classified (transient vs permanent) and routed accordingly — never a silent retry loop.

resilient
G.04

Streaming responses

The LLM token stream is fan-routed through Redis Pub/Sub to the user's open SSE connection. If they reconnect, they get the tail from a persistent log — no lost output.

SSE · resumable
G.05

Replayable execution traces

Every node call is logged with inputs, outputs and elapsed time. Triage incidents by replaying the exact graph that misbehaved, in a staging environment, against the same model and prompt version.

replay

LLM observability & token-budget dashboards.

If you can't tell which feature, which prompt version and which user cohort drove yesterday's bill, you don't have a production AI system — you have a research project with a credit card attached. Every LLM call we ship is wrapped, versioned, and accounted for, with the data flowing straight into your Django admin.

01
Per-call telemetry

Model, prompt version, input tokens, output tokens, latency, cost-in-£, tool calls and final outcome — logged to a Django model on every call.

02
Prompt versioning DB

Every prompt template is content-addressed and versioned. A/B switching is a single Django admin change. Rollbacks are instant and auditable.

03
Budget caps & alerts

Per-feature daily and monthly token budgets. Soft alert at 70%, hard stop at 100%. Alerts route to Slack, PagerDuty or your incident channel of choice.

04
Sentry / OTel integration

LLM latency, error rate and token spend are first-class metrics in Sentry and OpenTelemetry. Treat them the way you'd treat any other production service.

05
Offline eval suite

Ground-truth dataset, judge-LLM or rules-based scoring, pass-rate tracked per prompt version. Eval runs are gated in CI — regressions block merges.

06
PII redaction & logging policy

Configurable redaction before any prompt or response is logged. Retention policy enforced at the database level. GDPR-compatible from day one.

Is this the right engagement for you?

Yes, if… we're a fit

  • You have an existing Django (or DRF) platform you want to extend with AI
  • Your content/data already lives in Postgres or is migrating there
  • You want explainable, auditable AI features — not a black box your team can't debug
  • Regulated industry: legal, healthcare, fintech, public sector, education
  • You'd rather own the stack than rent it forever

Probably not, if… consider others

  • The fastest path is dropping a vendor widget in a marketing site
  • You have no Django footprint and won't build one
  • You want a one-week prototype with no production handover
  • You expect frontier-model research, not applied product engineering

Common questions.

Why pgvector instead of a dedicated vector database?

Three reasons: one source of truth (embeddings live next to the records they describe, in the same transaction), no synchronisation lag (an update to a Django model updates the embedding in the same transaction), and dramatically lower operational surface (one database to back up, monitor and secure instead of two). pgvector handles tens of millions of vectors comfortably on a single Postgres instance — most production RAG systems never need anything more.

What is Reciprocal Rank Fusion and why does it matter for hybrid search?

RRF is the standard algorithm for merging two ranked result lists (lexical and semantic) without needing to normalise their scores. It works because rank-based fusion is robust to wildly different score distributions — BM25 and cosine similarity don't have comparable units, but their ranks always do. The smoothing constant k (canonically 60) softens the effect of high ranks so that good results in either list survive the merge.

How do you handle long-running LLM workflows that need to wait for human review?

LangGraph state machines, persisted to Postgres, executed inside Celery workers. The graph can pause at any node — for human approval, external webhook, or scheduled retry — and resume from exactly where it left off. We never hold an HTTP request open for an LLM chain; the graph is the unit of execution, not the request. This is also how we survive deploys mid-conversation.

How do you track LLM costs and prevent runaway token spend?

Every LLM call goes through a thin wrapper that logs model, prompt version, token usage and latency to a Django model. A dashboard in the admin shows daily spend by feature, by user cohort, and by prompt version. Hard budget caps fire alerts and can stop a feature before it costs the company. The same telemetry flows to Sentry and OpenTelemetry for performance regressions.

Can you run open-source models on our own infrastructure?

Yes — Llama, Mistral, Qwen, and the fine-tuned derivatives. We deploy them on your VPC or on-prem with vLLM, handle quantisation and batching, and produce the security paperwork (data flow diagrams, model cards, retention policy) that procurement and legal need. Suitable for regulated and public-sector clients who can't send data to a third-party LLM provider.

What about evals — how do you know the AI is actually working?

An offline eval suite with a ground-truth dataset, scored either by rules or an LLM judge depending on the task. Pass-rate is tracked per prompt version. Eval runs are gated in CI: a prompt change that drops eval pass-rate below threshold blocks the merge. We ship the eval suite to your repo on day one — you own it, not us.

Ready to ship AI you can actually own?
Let's scope it.

AI discovery sprint
2 weeks · fixed-price · £8k
RAG build
6–10 weeks · from £35k
Agent & LLMOps squad
Rolling monthly · £18–28k