Running LLMs in production
LLMOps — cost, latency, evals, and RAG pipelines at scale
Ananya had built a beautiful AI assistant for her company's customer support. Powered by GPT-4, it answered questions accurately. The product team loved it. They put it in front of 10,000 users.
The first monthly AWS bill arrived. ₹12 lakh. For one feature. Her CEO did not love it.
"We need to make this 10x cheaper without making it 10x worse," he said.
Welcome to LLMOps — the engineering of running Large Language Models in production without going broke.
The LLMOps problems that don't exist in regular MLOps
Regular ML models: small, fast, cheap to run, deterministic.
LLMs: massive, slow, expensive to run, probabilistic.
This creates unique challenges:
1. Cost: GPT-4 costs money per token. At scale, this adds up fast.
2. Latency: LLMs can take 2-10 seconds to respond. Users hate waiting.
3. Evaluation: How do you know if the output is "good"? There's no simple accuracy score.
4. Hallucinations: LLMs confidently make things up. In customer support, this is catastrophic.
5. Prompt injection: Users try to manipulate your prompt. Security is different.
Solving cost: the model routing strategy
Not every query needs GPT-4. Most questions are simple — "What are your business hours?" — and can be answered by a smaller, cheaper model.
```python
def route_query(query: str) -> str:
"""Route to the right model based on complexity."""# Simple queries → small, cheap model
if is_simple_query(query):
return call_gpt35_turbo(query) # 10x cheaper# Medium complexity → medium model
elif is_medium_query(query):
return call_claude_haiku(query) # 3x cheaper than GPT-4# Complex reasoning → expensive model
else:
return call_gpt4(query) # Full power when needed```
Ananya implemented this. 70% of queries went to cheaper models. Monthly cost: ₹3.2 lakh. Same quality for users.
Solving hallucinations: RAG
LLMs make up answers when they don't know something. For customer support, you need answers grounded in your actual documentation.
RAG (Retrieval Augmented Generation):
1. Index all your support docs, FAQs, policies into a vector database
2. When a user asks a question, retrieve the 3-5 most relevant chunks
3. Include those chunks in the prompt: "Based only on this context: [chunks], answer: [question]"
4. The LLM answers from the context, not from memory
```python
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
Index your documents once
documents = SimpleDirectoryReader("support_docs/").load_data()
index = VectorStoreIndex.from_documents(documents)
At query time
def answer_with_rag(question: str) -> str:
query_engine = index.as_query_engine()
response = query_engine.query(question)
return str(response)```
Hallucination rate dropped from 23% to 4%. Users trusted the answers.
Evaluation: how do you measure "good"?
For traditional ML: compare prediction vs known answer. Accuracy = 87%.
For LLMs: the output is text. How do you score "Is this a helpful response?"
Practical approaches:
```python
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall
Evaluate a RAG pipeline
results = evaluate(
dataset=test_questions,
metrics=[faithfulness, answer_relevancy, context_recall])
print(results)
faithfulness: 0.91 (answer matches the retrieved context)
answer_relevancy: 0.87 (answer addresses the question)
context_recall: 0.79 (retrieved context covers the question)
```
Ananya set up a weekly eval pipeline: 200 test questions with known good answers, run against the production system, track scores over time. Any drop triggers a review.
The full LLMOps stack
```
User query
↓Input guardrails (block prompt injections, PII)
↓Query routing (simple/medium/complex → right model)
↓Cache check (identical queries return cached response)
↓RAG retrieval (find relevant context)
↓LLM call (with context + system prompt)
↓Output validation (check for hallucinations, PII leaks)
↓Response to user
↓Log everything to LangSmith / Langfuse for tracing
↓Weekly eval pipeline (faithfulness, relevancy scores)
↓Alert if scores drop → human review → prompt update
```
Ananya's monthly bill: ₹2.8 lakh. Hallucination rate: 3.1%. P95 latency: 1.2 seconds. CEO happy.
The lesson: LLMs in production are an engineering problem, not just a prompting problem.
Route queries to cheaper models for simple tasks — 70% of queries rarely need GPT-4
RAG grounds LLM answers in your actual data, reducing hallucinations dramatically
Use RAGAS metrics (faithfulness, relevancy) to evaluate LLM output quality
Log and trace every LLM call with tools like LangSmith or Langfuse