Learn 🧠 All Concepts (20) 🤖 What is an LLM? 📚 RAG Explained ⚡ AI Agents 💻 Run AI Locally 🇮🇳 AI in India 📖 Learn Tracks 🔧 DevOps Track ⚙️ AI Ops Track 🗺️ AI Engineer Roadmap
Tools 🔧 AI Tools Directory 🔓 Open Source AI ⭐ Top GitHub Repos ✦ Claude Skill Repos 🚀 Ready-to-Deploy Projects
Build 🏗️ Build Hub 🎯 Master Prompts 🧩 RAG Agents 🚀 App Megaprompts
Workflows ⚡ All Workflows (22) 🎥 Text to Video 🎞️ Image to Video 🔊 Text to Speech ♻️ Automation
Resources 🧪 Colab Notebooks ⚙️ n8n Workflows 📈 Algo Trading 💰 Passive Income
🗂️ Browse All Topics About AItheGuru
Learn DevOps Priya makes the invisible visible
DevOps Ch 8 / 10 Advanced
📊

Priya makes the invisible visible

Prometheus, Grafana and alerting — know your system is healthy before users complain

⏱ 12 min 5 commands 5 takeaways
📊
In this chapter
Priya
SRE at a Bengaluru fintech
The story

Priya joined the SRE team after 3 years as a backend developer. Her first week she was handed a pager. Her second week it went off at 1am. The app was slow. Users were complaining. And nobody knew why because there were no dashboards, no metrics, no logs — nothing.

She fixed the immediate issue by restarting services. But she vowed it would never happen again.

Three pillars of observability:

- Metrics: numbers over time (CPU percent, request rate, error rate)

- Logs: text records of events (errors, warnings, info messages)

- Traces: the path of a single request through your whole system

Metrics tell you something is wrong. Logs tell you what went wrong. Traces tell you where.

Setting up Prometheus. Prometheus scrapes your app's /metrics endpoint every 15 seconds:

# docker-compose.yml
prometheus:
  image: prom/prometheus
  ports: ["9090:9090"]
  volumes:
    - ./prometheus.yml:/etc/prometheus/prometheus.yml
# prometheus.yml
scrape_configs:
  - job_name: myapp
    static_configs:
      - targets: ['myapp:8080']

In your FastAPI app, one pip install adds automatic metrics:

pip install prometheus-fastapi-instrumentator
from prometheus_fastapi_instrumentator import Instrumentator
Instrumentator().instrument(app).expose(app)

That one line gives you request count, request duration, response size, and active requests automatically.

Setting up Grafana. Prometheus stores metrics. Grafana visualises them:

grafana:
  image: grafana/grafana
  ports: ["3000:3000"]

Open localhost:3000, add Prometheus as a data source, import dashboard ID 1860. Instant system dashboard.

The four golden signals (Google SRE book):

1. Latency: how long requests take (p50, p95, p99)

2. Traffic: requests per second

3. Errors: percentage of requests failing

4. Saturation: how full is the system (CPU, memory, queue depth)

If all four golden signals are healthy, your service is healthy.

Setting up alerts in Prometheus:

groups:
  - name: myapp
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 2m
        annotations:
          summary: "Error rate above 5 percent for 2 minutes"

Uptime Kuma for smaller teams. A single Docker container that monitors URLs and sends WhatsApp or Telegram alerts when they go down:

docker run -d --restart=always -p 3001:3001 louislam/uptime-kuma:1

Add your endpoints and you have a basic status page plus alerting in 5 minutes. Free, self-hosted, excellent for Indian startups.

Structured logging makes logs machine-readable:

import structlog
log = structlog.get_logger()
log.info("payment_processed",
         user_id=user_id,
         amount=amount,
         currency="INR",
         duration_ms=elapsed)

Structured logs let you search by user_id, filter by amount, or count payments by currency. Never use print() in production.

Priya never had another mystery 1am incident. The next time the pager went off she could see in 30 seconds exactly what was wrong, which service, which endpoint, when it started.

Key takeaways

Observability has three pillars: metrics (numbers), logs (events), traces (request paths)

One pip install adds automatic Prometheus metrics to FastAPI — request count, duration, error rate

The four golden signals: latency, traffic, errors, saturation — monitor all four for any service

Uptime Kuma is the fastest path to monitoring plus WhatsApp alerts for small teams

Structured logging (JSON key-value pairs) makes logs searchable and parseable by machines

Commands from this chapter
$ pip install prometheus-fastapi-instrumentator
Add metrics endpoint to FastAPI in one line
$ docker run -d -p 3001:3001 louislam/uptime-kuma:1
Start Uptime Kuma monitoring dashboard
$ rate(http_requests_total{status=~"5.."}[5m])
Prometheus: calculate 5xx error rate
$ histogram_quantile(0.95, http_request_duration_seconds_bucket)
Prometheus: calculate p95 latency
$ docker compose up prometheus grafana
Start full monitoring stack