Learn › DevOps › Priya makes the invisible visible

DevOps Ch 8 / 10 Advanced

📊

Priya makes the invisible visible

Prometheus, Grafana and alerting — know your system is healthy before users complain

⏱ 12 min 5 commands 5 takeaways

📊

In this chapter

Priya

SRE at a Bengaluru fintech

The story

Priya joined the SRE team after 3 years as a backend developer. Her first week she was handed a pager. Her second week it went off at 1am. The app was slow. Users were complaining. And nobody knew why because there were no dashboards, no metrics, no logs — nothing.

She fixed the immediate issue by restarting services. But she vowed it would never happen again.

Three pillars of observability:

- Metrics: numbers over time (CPU percent, request rate, error rate)

- Logs: text records of events (errors, warnings, info messages)

- Traces: the path of a single request through your whole system

Metrics tell you something is wrong. Logs tell you what went wrong. Traces tell you where.

Setting up Prometheus. Prometheus scrapes your app's /metrics endpoint every 15 seconds:

# docker-compose.yml
prometheus:
  image: prom/prometheus
  ports: ["9090:9090"]
  volumes:
    - ./prometheus.yml:/etc/prometheus/prometheus.yml

# prometheus.yml
scrape_configs:
  - job_name: myapp
    static_configs:
      - targets: ['myapp:8080']

In your FastAPI app, one pip install adds automatic metrics:

pip install prometheus-fastapi-instrumentator

from prometheus_fastapi_instrumentator import Instrumentator
Instrumentator().instrument(app).expose(app)

That one line gives you request count, request duration, response size, and active requests automatically.

Setting up Grafana. Prometheus stores metrics. Grafana visualises them:

grafana:
  image: grafana/grafana
  ports: ["3000:3000"]

Open localhost:3000, add Prometheus as a data source, import dashboard ID 1860. Instant system dashboard.

The four golden signals (Google SRE book):

1. Latency: how long requests take (p50, p95, p99)

2. Traffic: requests per second

3. Errors: percentage of requests failing

4. Saturation: how full is the system (CPU, memory, queue depth)

If all four golden signals are healthy, your service is healthy.

Setting up alerts in Prometheus:

groups:
  - name: myapp
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 2m
        annotations:
          summary: "Error rate above 5 percent for 2 minutes"

Uptime Kuma for smaller teams. A single Docker container that monitors URLs and sends WhatsApp or Telegram alerts when they go down:

docker run -d --restart=always -p 3001:3001 louislam/uptime-kuma:1

Add your endpoints and you have a basic status page plus alerting in 5 minutes. Free, self-hosted, excellent for Indian startups.

Structured logging makes logs machine-readable:

import structlog
log = structlog.get_logger()

log.info("payment_processed",
         user_id=user_id,
         amount=amount,
         currency="INR",
         duration_ms=elapsed)

Structured logs let you search by user_id, filter by amount, or count payments by currency. Never use print() in production.

Priya never had another mystery 1am incident. The next time the pager went off she could see in 30 seconds exactly what was wrong, which service, which endpoint, when it started.

Key takeaways

Observability has three pillars: metrics (numbers), logs (events), traces (request paths)

One pip install adds automatic Prometheus metrics to FastAPI — request count, duration, error rate

The four golden signals: latency, traffic, errors, saturation — monitor all four for any service

Uptime Kuma is the fastest path to monitoring plus WhatsApp alerts for small teams

Structured logging (JSON key-value pairs) makes logs searchable and parseable by machines

Commands from this chapter

$ pip install prometheus-fastapi-instrumentator

Add metrics endpoint to FastAPI in one line

$ docker run -d -p 3001:3001 louislam/uptime-kuma:1

Start Uptime Kuma monitoring dashboard

$ rate(http_requests_total{status=~"5.."}[5m])

Prometheus: calculate 5xx error rate

$ histogram_quantile(0.95, http_request_duration_seconds_bucket)

Prometheus: calculate p95 latency

$ docker compose up prometheus grafana

Start full monitoring stack