The 3am alert nobody wants
SRE, observability, Prometheus and Grafana — knowing your system is alive
Vikram's phone buzzed at 3:17am. PagerDuty. The payment service error rate had jumped from 0.1% to 23% in the last 5 minutes. 12,000 users couldn't checkout.
He had 15 minutes to diagnose and fix before his manager's manager woke up.
This is the life of an SRE — Site Reliability Engineer. And the only way to survive it is to have excellent observability.
What is observability?
Observability is knowing what your system is doing at any moment. Three pillars:
Metrics — numbers over time. CPU usage, error rate, response time, active users. Tells you *what* is happening.
Logs — text records of events. "User 12345 tried to pay, got error 500 at 03:17:23". Tells you *what happened*.
Traces — the journey of one request through your system. User clicked "Pay" → hit load balancer → hit payment service → called bank API → returned. Tells you *where* the slowdown is.
Prometheus + Grafana — the most popular observability stack
Prometheus scrapes metrics from your services every 15 seconds and stores them. Your app exposes a `/metrics` endpoint; Prometheus collects from it.
Grafana connects to Prometheus and draws beautiful dashboards. Every metric becomes a graph. You can see your entire system at a glance.
```yaml
prometheus.yml — tells Prometheus what to scrape
scrape_configs:
- job_name: 'payment-service'
static_configs:
- targets: ['payment-service:8080']
scrape_interval: 15s```
Setting up alerts so Prometheus wakes you (not your manager)
```yaml
alert_rules.yml
groups:
- name: payment_alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status="500"}[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "Payment service error rate above 5%"
description: "Error rate is {{ $value | humanizePercentage }}"```
SLOs — how reliable should you actually be?
SLO = Service Level Objective. A promise to yourself and your users about reliability.
Common SLOs:
- 99.9% uptime = allowed to be down 8.7 hours per year
- 99.99% uptime = allowed to be down 52 minutes per year
- p95 response time < 200ms = 95% of requests return in under 200ms
SLOs force honest conversations. "We need 99.999% uptime" sounds good until you calculate that achieving it costs ₹50 lakh extra per year for 5 minutes of additional uptime.
What Vikram found at 3:17am
He opened Grafana. The dashboard showed:
- Payment service error rate: 23% (normal: 0.1%)
- Response time: 8 seconds (normal: 120ms)
- Database connection pool: 100% utilized
The new code deployed at 2am had a query that ran without an index. Under load, it locked the database.
```sql
-- The slow query (no index on user_id)
SELECT * FROM transactions WHERE user_id = 12345 ORDER BY created_at DESC;
-- The fix: add the missing index
CREATE INDEX CONCURRENTLY idx_transactions_user_id ON transactions(user_id);
```
By 3:31am, error rate was back to 0.1%. 14 minutes total.
Without Grafana, he would have been SSHing into 20 servers trying to find the problem. With observability, the problem was obvious in 4 minutes.
Vikram went back to sleep. His manager's manager never woke up.
Observability = metrics (what) + logs (what happened) + traces (where)
Prometheus collects metrics; Grafana displays them as dashboards
Set alerts on metrics so systems wake you, not your manager
SLOs are honest targets — define them before incidents, not after