Priya makes the invisible visible
Prometheus, Grafana and alerting — know your system is healthy before users complain
Priya joined the SRE team after 3 years as a backend developer. Her first week she was handed a pager. Her second week it went off at 1am. The app was slow. Users were complaining. And nobody knew why because there were no dashboards, no metrics, no logs — nothing.
She fixed the immediate issue by restarting services. But she vowed it would never happen again.
Three pillars of observability:
- Metrics: numbers over time (CPU percent, request rate, error rate)
- Logs: text records of events (errors, warnings, info messages)
- Traces: the path of a single request through your whole system
Metrics tell you something is wrong. Logs tell you what went wrong. Traces tell you where.
Setting up Prometheus. Prometheus scrapes your app's /metrics endpoint every 15 seconds:
# docker-compose.yml
prometheus:
image: prom/prometheus
ports: ["9090:9090"]
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml# prometheus.yml
scrape_configs:
- job_name: myapp
static_configs:
- targets: ['myapp:8080']In your FastAPI app, one pip install adds automatic metrics:
pip install prometheus-fastapi-instrumentatorfrom prometheus_fastapi_instrumentator import Instrumentator
Instrumentator().instrument(app).expose(app)That one line gives you request count, request duration, response size, and active requests automatically.
Setting up Grafana. Prometheus stores metrics. Grafana visualises them:
grafana:
image: grafana/grafana
ports: ["3000:3000"]Open localhost:3000, add Prometheus as a data source, import dashboard ID 1860. Instant system dashboard.
The four golden signals (Google SRE book):
1. Latency: how long requests take (p50, p95, p99)
2. Traffic: requests per second
3. Errors: percentage of requests failing
4. Saturation: how full is the system (CPU, memory, queue depth)
If all four golden signals are healthy, your service is healthy.
Setting up alerts in Prometheus:
groups:
- name: myapp
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 2m
annotations:
summary: "Error rate above 5 percent for 2 minutes"Uptime Kuma for smaller teams. A single Docker container that monitors URLs and sends WhatsApp or Telegram alerts when they go down:
docker run -d --restart=always -p 3001:3001 louislam/uptime-kuma:1Add your endpoints and you have a basic status page plus alerting in 5 minutes. Free, self-hosted, excellent for Indian startups.
Structured logging makes logs machine-readable:
import structlog
log = structlog.get_logger()log.info("payment_processed",
user_id=user_id,
amount=amount,
currency="INR",
duration_ms=elapsed)Structured logs let you search by user_id, filter by amount, or count payments by currency. Never use print() in production.
Priya never had another mystery 1am incident. The next time the pager went off she could see in 30 seconds exactly what was wrong, which service, which endpoint, when it started.
Observability has three pillars: metrics (numbers), logs (events), traces (request paths)
One pip install adds automatic Prometheus metrics to FastAPI — request count, duration, error rate
The four golden signals: latency, traffic, errors, saturation — monitor all four for any service
Uptime Kuma is the fastest path to monitoring plus WhatsApp alerts for small teams
Structured logging (JSON key-value pairs) makes logs searchable and parseable by machines