Learn 🧠 All Concepts (20) 🤖 What is an LLM? 📚 RAG Explained ⚡ AI Agents 💻 Run AI Locally 🇮🇳 AI in India 📖 Learn Tracks 🔧 DevOps Track ⚙️ AI Ops Track 🗺️ AI Engineer Roadmap
Tools 🔧 AI Tools Directory 🔓 Open Source AI ⭐ Top GitHub Repos ✦ Claude Skill Repos 🚀 Ready-to-Deploy Projects
Build 🏗️ Build Hub 🎯 Master Prompts 🧩 RAG Agents 🚀 App Megaprompts
Workflows ⚡ All Workflows (22) 🎥 Text to Video 🎞️ Image to Video 🔊 Text to Speech ♻️ Automation
Resources 🧪 Colab Notebooks ⚙️ n8n Workflows 📈 Algo Trading 💰 Passive Income
🗂️ Browse All Topics About AItheGuru
Learn AI Ops Rohan cuts the GPU bill by 60 percent
AI Ops Ch 9 / 9 Expert
💰

Rohan cuts the GPU bill by 60 percent

Model quantisation, batching, caching and serving ML cheaply at scale

⏱ 12 min 5 commands 5 takeaways
💰
In this chapter
Rohan
ML platform lead, Series C startup
The story

Rohan's team was spending 18,000 dollars per month on GPU instances to serve 4 production ML models. His CEO asked him to cut it by half without degrading user experience.

The audit revealed three surprising facts:

- Average GPU utilisation was 12 percent — GPUs sitting idle 88 percent of the time

- Preprocessing was taking 38ms, almost as long as the 45ms inference

- 34 percent of all requests were exact duplicates

Fix 1: Quantisation. Reduces model weights from 32-bit floats to 8-bit integers. The model becomes 4x smaller and 2-3x faster with minimal accuracy loss.

import torch
from torch.quantization import quantize_dynamic
model = load_model('recommendation_model.pt')
quantised = quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
# Before: 480 MB, 45ms per request
# After:  121 MB, 18ms per request

For LLMs specifically: GGUF quantisation (Q4_K_M, Q8_0) makes a 7B parameter model run on a cheap T4 GPU instead of an expensive A100.

Fix 2: Dynamic batching. GPUs are massively parallel. Sending one request at a time wastes 99 percent of that parallelism.

import asyncio
class BatchServer:
    def __init__(self, model, max_batch=32, max_wait_ms=50):
        self.model = model
        self.max_batch = max_batch
        self.max_wait_ms = max_wait_ms
        self.queue = asyncio.Queue()
    async def predict(self, features):
        future = asyncio.Future()
        await self.queue.put((features, future))
        return await future
    async def worker(self):
        while True:
            batch = []
            deadline = asyncio.get_event_loop().time() + self.max_wait_ms / 1000
            while len(batch) < self.max_batch:
                try:
                    timeout = deadline - asyncio.get_event_loop().time()
                    item = await asyncio.wait_for(self.queue.get(), timeout=timeout)
                    batch.append(item)
                except asyncio.TimeoutError:
                    break
            if batch:
                results = self.model.predict_batch([b[0] for b in batch])
                for i, (_, future) in enumerate(batch):
                    future.set_result(results[i])

Batching turns 100 sequential 45ms requests into 1 batched 55ms request — 45x better GPU utilisation.

Fix 3: Semantic caching. Cache not just exact duplicate requests but also semantically similar ones using embedding similarity.

class SemanticCache:
    def __init__(self, embed_model, threshold=0.95):
        self.redis = redis.Redis()
        self.embed_model = embed_model
        self.threshold = threshold
    def get_or_compute(self, query, compute_fn):
        q_emb = self.embed_model.encode(query)
        for key in self.redis.keys('cache:*'):
            cached = self.redis.hgetall(key)
            c_emb = np.frombuffer(cached[b'embedding'])
            if np.dot(q_emb, c_emb) >= self.threshold:
                return cached[b'result'].decode()
        result = compute_fn(query)
        self.redis.hset(f"cache:{hash(query)}", mapping={
            'embedding': q_emb.tobytes(), 'result': result
        })
        self.redis.expire(f"cache:{hash(query)}", 3600)
        return result

Fix 4: Auto-scaling to zero during off-peak hours.

# Kubernetes HPA
spec:
  minReplicas: 1
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Results after all four fixes:

- Quantisation: 45ms to 18ms inference, 4x smaller model

- Batching: GPU utilisation from 12% to 73%

- Caching: 34% of requests served from cache at zero GPU cost

- Auto-scaling: zero GPU cost during off-peak hours

Total cost reduction: from 18,000 to 6,800 dollars per month — 62 percent saving.

Key takeaways

Quantisation (32-bit to 8-bit) makes models 4x smaller and 2-3x faster with minimal accuracy loss

GPU utilisation of 12% means you are paying for 8x more GPU than you actually use

Dynamic batching: collect requests for 50ms then process as a batch — massive GPU efficiency gain

Semantic caching with embeddings serves similar queries from cache without running the model at all

Auto-scaling to zero during off-peak hours (3am to 7am) eliminates payment for idle GPU capacity

Commands from this chapter
$ torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
Quantise PyTorch model to INT8
$ nvidia-smi dmon -s u
Monitor real-time GPU utilisation
$ pip install vllm
Install vLLM for high-throughput LLM serving with built-in batching
$ kubectl autoscale deployment mymodel --min=1 --max=20 --cpu-percent=70
Set up Kubernetes auto-scaling
$ redis-cli info stats | grep keyspace_hits
Check cache hit rate in Redis