Rohan cuts the GPU bill by 60 percent
Model quantisation, batching, caching and serving ML cheaply at scale
Rohan's team was spending 18,000 dollars per month on GPU instances to serve 4 production ML models. His CEO asked him to cut it by half without degrading user experience.
The audit revealed three surprising facts:
- Average GPU utilisation was 12 percent — GPUs sitting idle 88 percent of the time
- Preprocessing was taking 38ms, almost as long as the 45ms inference
- 34 percent of all requests were exact duplicates
Fix 1: Quantisation. Reduces model weights from 32-bit floats to 8-bit integers. The model becomes 4x smaller and 2-3x faster with minimal accuracy loss.
import torch
from torch.quantization import quantize_dynamicmodel = load_model('recommendation_model.pt')
quantised = quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)# Before: 480 MB, 45ms per request
# After: 121 MB, 18ms per requestFor LLMs specifically: GGUF quantisation (Q4_K_M, Q8_0) makes a 7B parameter model run on a cheap T4 GPU instead of an expensive A100.
Fix 2: Dynamic batching. GPUs are massively parallel. Sending one request at a time wastes 99 percent of that parallelism.
import asyncioclass BatchServer:
def __init__(self, model, max_batch=32, max_wait_ms=50):
self.model = model
self.max_batch = max_batch
self.max_wait_ms = max_wait_ms
self.queue = asyncio.Queue() async def predict(self, features):
future = asyncio.Future()
await self.queue.put((features, future))
return await future async def worker(self):
while True:
batch = []
deadline = asyncio.get_event_loop().time() + self.max_wait_ms / 1000
while len(batch) < self.max_batch:
try:
timeout = deadline - asyncio.get_event_loop().time()
item = await asyncio.wait_for(self.queue.get(), timeout=timeout)
batch.append(item)
except asyncio.TimeoutError:
break
if batch:
results = self.model.predict_batch([b[0] for b in batch])
for i, (_, future) in enumerate(batch):
future.set_result(results[i])Batching turns 100 sequential 45ms requests into 1 batched 55ms request — 45x better GPU utilisation.
Fix 3: Semantic caching. Cache not just exact duplicate requests but also semantically similar ones using embedding similarity.
class SemanticCache:
def __init__(self, embed_model, threshold=0.95):
self.redis = redis.Redis()
self.embed_model = embed_model
self.threshold = threshold def get_or_compute(self, query, compute_fn):
q_emb = self.embed_model.encode(query)
for key in self.redis.keys('cache:*'):
cached = self.redis.hgetall(key)
c_emb = np.frombuffer(cached[b'embedding'])
if np.dot(q_emb, c_emb) >= self.threshold:
return cached[b'result'].decode()
result = compute_fn(query)
self.redis.hset(f"cache:{hash(query)}", mapping={
'embedding': q_emb.tobytes(), 'result': result
})
self.redis.expire(f"cache:{hash(query)}", 3600)
return resultFix 4: Auto-scaling to zero during off-peak hours.
# Kubernetes HPA
spec:
minReplicas: 1
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70Results after all four fixes:
- Quantisation: 45ms to 18ms inference, 4x smaller model
- Batching: GPU utilisation from 12% to 73%
- Caching: 34% of requests served from cache at zero GPU cost
- Auto-scaling: zero GPU cost during off-peak hours
Total cost reduction: from 18,000 to 6,800 dollars per month — 62 percent saving.
Quantisation (32-bit to 8-bit) makes models 4x smaller and 2-3x faster with minimal accuracy loss
GPU utilisation of 12% means you are paying for 8x more GPU than you actually use
Dynamic batching: collect requests for 50ms then process as a batch — massive GPU efficiency gain
Semantic caching with embeddings serves similar queries from cache without running the model at all
Auto-scaling to zero during off-peak hours (3am to 7am) eliminates payment for idle GPU capacity