Learn › Linux for Production Support › Priya writes the postmortem that prevents the next incident

Linux for Production Support Ch 30 / 32 Advanced

📝

Priya writes the postmortem that prevents the next incident

Timeline reconstruction, 5 whys, action items — turning incidents into improvements

⏱ 13 min 6 commands 5 takeaways

📝

In this chapter

Priya

SRE writing her first proper postmortem

The story

The payment service went down for 47 minutes on a Friday evening. 3,200 transactions failed. The company lost revenue and customer trust. The on-call engineer fixed it, wrote "restarted the service" in the incident ticket, and went to bed.

Monday morning, the same failure happened again. Different engineer, same fix, same result. The root cause was never found because nobody wrote a proper postmortem.

Priya's manager called a meeting. Going forward, every incident over 15 minutes gets a postmortem. Not to blame people. To prevent the same thing from happening again.

WHAT IS A POSTMORTEM

A postmortem is a structured document that answers:

1. What happened?

2. Why did it happen?

3. How did we respond?

4. What do we fix so it never happens again?

It is not a blame document. It is a system improvement document. The goal is to make the system more resilient, not to punish the engineer who was on-call.

RECONSTRUCTING THE TIMELINE

Before you can write anything, you need to know exactly what happened and when. Pull logs from all systems involved.

# Payment service logs around the incident time:
journalctl -u payment-service --since "2026-03-16 18:00" --until "2026-03-16 19:30" > /tmp/payment-logs.txt

# All errors across all services in the window:
journalctl --since "2026-03-16 18:00" --until "2026-03-16 19:30" -p err > /tmp/all-errors.txt

# What changed? Git commits deployed that day:
git log --since="2026-03-16 00:00" --until="2026-03-16 18:00" --oneline

# Database slow queries during the window:
grep "2026-03-16 18" /var/log/postgresql/postgresql.log | grep "duration:" | sort -t: -k3 -rn | head -20

# Load balancer access log — when did traffic drop?
awk '/2026-03-16 18:/ {print}' /var/log/nginx/access.log | wc -l
# Compare to normal hour:
awk '/2026-03-16 17:/ {print}' /var/log/nginx/access.log | wc -l

FINDING ROOT CAUSE — THE 5 WHYS

The 5 Whys technique: keep asking why until you reach a systemic cause.

Payment service went down (surface symptom).

Why? OOM killer terminated it.

Why? It ran out of memory.

Why? A memory leak in the connection pool.

Why? Connection pool size was unlimited.

Why? Nobody set a max connection pool limit when the service was written.

Root cause: no connection pool limit. Fix: add max pool size and add a memory alert.

Priya's root cause was not "the engineer forgot to set a limit." It was "our service template does not include connection pool configuration, so new services are deployed without it."

LOOKING FOR PATTERNS IN LOGS

# Were there warning signs before the crash?
grep "warn\|WARN" /opt/payment/logs/app.log | tail -100

# Memory growing before OOM?
# If you have metrics: query Prometheus for memory trend in the hour before
# If no metrics: look for GC logs
grep "GC" /opt/payment/logs/app.log | grep "2026-03-16 1[78]:" | tail -30

# Was there unusual traffic?
awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -10
# Compare to previous Friday at same time

THE POSTMORTEM TEMPLATE

# Incident: Payment Service Down — 2026-03-16 18:14 to 18:47

## Summary
Payment service crashed due to memory exhaustion. 3,200 transactions failed.
Duration: 47 minutes. Customer impact: all payments declined.

## Timeline
17:52 — Deployment of v2.4.1 completed (connection pool change merged last week)
18:00 — Memory usage begins rising (visible in Netdata retrospectively)
18:14 — PagerDuty alert fires: payment service down
18:19 — On-call engineer Priya begins investigating
18:23 — Identified OOM kill in kern.log
18:31 — Heap dump analysis shows connection pool growing unboundedly
18:39 — Temporary fix: restarted with JAVA_OPTS="-Xmx2g" memory cap
18:47 — Service restored. Monitoring for stability.
19:30 — Root cause confirmed: missing max pool size configuration

## Root Cause
PR #847 changed the database connection pool implementation but removed the
max_pool_size parameter that had existed in the previous version. No alert
existed for connection pool size. Memory grew unboundedly over 22 minutes.

## Why We Missed It
- PR review did not catch the missing parameter
- No monitoring for connection pool size
- No memory alert existed (only a service-down alert)

## Action Items
[ ] Add max_pool_size=50 to connection config — Vijay — 2026-03-18
[ ] Add Prometheus alert for memory > 80% — Priya — 2026-03-19
[ ] Add connection pool size to standard service template — Rajan — 2026-03-21
[ ] Add memory trend to deployment checklist — team — 2026-03-21

THE COMMANDS FOR EVERY POSTMORTEM

# Exact failure time from logs:
journalctl -u myservice --since "YYYY-MM-DD HH:00" --until "YYYY-MM-DD HH:59" | grep -i "fatal\|crash\|oom\|killed"

# What the OOM killer killed:
dmesg | grep -i "oom\|killed" | grep "YYYY-MM-DD HH"

# What changed that day (deployments):
git log --since="YYYY-MM-DD 00:00" --until="YYYY-MM-DD HH:00" --oneline

# Traffic comparison (was there a spike?):
for HOUR in 15 16 17 18 19; do echo -n "$HOUR:00 — "; grep "YYYY-MM-DD $HOUR:" /var/log/nginx/access.log | wc -l; done

Priya's postmortem led to 4 action items. All 4 were completed within a week. The same failure never happened again. Six months later, when a different service had a memory leak, the alert fired at 80% memory usage — 20 minutes before it would have crashed — and the engineer fixed it during business hours with no customer impact.

The postmortem practice did not just prevent one failure. It built a culture of improving the system, not blaming the person.

Key takeaways

A postmortem is a system improvement document, not a blame document — the goal is to fix the system, not punish the person

Pull logs from all systems first: journalctl, nginx access logs, git commits, database slow queries

The 5 Whys: keep asking why until you reach a systemic cause, not a human error

Every action item needs an owner and a deadline — postmortems without assigned items change nothing

Traffic comparison (requests this hour vs same hour last week) quickly identifies whether an incident was caused by a spike

Commands from this chapter

$ journalctl --since 'YYYY-MM-DD HH:00' --until 'YYYY-MM-DD HH:59' -p err

Pull all errors in incident window across all services

$ dmesg | grep -i 'oom\|killed'

Find OOM killer events — shows what the kernel killed and when

$ git log --since='YYYY-MM-DD 00:00' --oneline

Find all deployments that day — changes often cause incidents

$ for H in 15 16 17 18 19; do echo -n "$H:00 — "; grep ":$H:" /var/log/nginx/access.log | wc -l; done

Compare traffic per hour to spot anomalous spikes

$ grep 'WARN\|warn' app.log | tail -100

Find warning signs that appeared before the crash

Count error types to find the most frequent failure pattern