Learn › DevOps › The 3am alert nobody wants

DevOps Ch 7 / 10 Expert

🚨

The 3am alert nobody wants

SRE, observability, Prometheus and Grafana — knowing your system is alive

⏱ 15 min 3 commands 4 takeaways

🚨

In this chapter

Vikram

SRE at a Bengaluru unicorn

The story

Vikram's phone buzzed at 3:17am. PagerDuty. The payment service error rate had jumped from 0.1% to 23% in the last 5 minutes. 12,000 users couldn't checkout.

He had 15 minutes to diagnose and fix before his manager's manager woke up.

This is the life of an SRE — Site Reliability Engineer. And the only way to survive it is to have excellent observability.

What is observability?

Observability is knowing what your system is doing at any moment. Three pillars:

Metrics — numbers over time. CPU usage, error rate, response time, active users. Tells you *what* is happening.

Logs — text records of events. "User 12345 tried to pay, got error 500 at 03:17:23". Tells you *what happened*.

Traces — the journey of one request through your system. User clicked "Pay" → hit load balancer → hit payment service → called bank API → returned. Tells you *where* the slowdown is.

Prometheus + Grafana — the most popular observability stack

Prometheus scrapes metrics from your services every 15 seconds and stores them. Your app exposes a `/metrics` endpoint; Prometheus collects from it.

Grafana connects to Prometheus and draws beautiful dashboards. Every metric becomes a graph. You can see your entire system at a glance.

```yaml

prometheus.yml — tells Prometheus what to scrape

scrape_configs:

- job_name: 'payment-service'

static_configs:
  - targets: ['payment-service:8080']
scrape_interval: 15s

```

Setting up alerts so Prometheus wakes you (not your manager)

```yaml

alert_rules.yml

groups:

- name: payment_alerts

rules:

- alert: HighErrorRate

expr: rate(http_requests_total{status="500"}[5m]) > 0.05
for: 2m
labels:
  severity: critical
annotations:
  summary: "Payment service error rate above 5%"
  description: "Error rate is {{ $value | humanizePercentage }}"

```

SLOs — how reliable should you actually be?

SLO = Service Level Objective. A promise to yourself and your users about reliability.

Common SLOs:

- 99.9% uptime = allowed to be down 8.7 hours per year

- 99.99% uptime = allowed to be down 52 minutes per year

- p95 response time < 200ms = 95% of requests return in under 200ms

SLOs force honest conversations. "We need 99.999% uptime" sounds good until you calculate that achieving it costs ₹50 lakh extra per year for 5 minutes of additional uptime.

What Vikram found at 3:17am

He opened Grafana. The dashboard showed:

- Payment service error rate: 23% (normal: 0.1%)

- Response time: 8 seconds (normal: 120ms)

- Database connection pool: 100% utilized

The new code deployed at 2am had a query that ran without an index. Under load, it locked the database.

```sql

-- The slow query (no index on user_id)

SELECT * FROM transactions WHERE user_id = 12345 ORDER BY created_at DESC;

-- The fix: add the missing index

CREATE INDEX CONCURRENTLY idx_transactions_user_id ON transactions(user_id);

```

By 3:31am, error rate was back to 0.1%. 14 minutes total.

Without Grafana, he would have been SSHing into 20 servers trying to find the problem. With observability, the problem was obvious in 4 minutes.

Vikram went back to sleep. His manager's manager never woke up.

Key takeaways

Observability = metrics (what) + logs (what happened) + traces (where)

Prometheus collects metrics; Grafana displays them as dashboards

Set alerts on metrics so systems wake you, not your manager

SLOs are honest targets — define them before incidents, not after

Commands from this chapter

$ curl localhost:9090/metrics

View raw Prometheus metrics

$ kubectl top pods

See CPU/memory usage of pods

$ kubectl logs -f pod-name

Follow live logs from a pod