Priya writes the postmortem that prevents the next incident
Timeline reconstruction, 5 whys, action items — turning incidents into improvements
The payment service went down for 47 minutes on a Friday evening. 3,200 transactions failed. The company lost revenue and customer trust. The on-call engineer fixed it, wrote "restarted the service" in the incident ticket, and went to bed.
Monday morning, the same failure happened again. Different engineer, same fix, same result. The root cause was never found because nobody wrote a proper postmortem.
Priya's manager called a meeting. Going forward, every incident over 15 minutes gets a postmortem. Not to blame people. To prevent the same thing from happening again.
WHAT IS A POSTMORTEM
A postmortem is a structured document that answers:
1. What happened?
2. Why did it happen?
3. How did we respond?
4. What do we fix so it never happens again?
It is not a blame document. It is a system improvement document. The goal is to make the system more resilient, not to punish the engineer who was on-call.
RECONSTRUCTING THE TIMELINE
Before you can write anything, you need to know exactly what happened and when. Pull logs from all systems involved.
# Payment service logs around the incident time:
journalctl -u payment-service --since "2026-03-16 18:00" --until "2026-03-16 19:30" > /tmp/payment-logs.txt# All errors across all services in the window:
journalctl --since "2026-03-16 18:00" --until "2026-03-16 19:30" -p err > /tmp/all-errors.txt# What changed? Git commits deployed that day:
git log --since="2026-03-16 00:00" --until="2026-03-16 18:00" --oneline# Database slow queries during the window:
grep "2026-03-16 18" /var/log/postgresql/postgresql.log | grep "duration:" | sort -t: -k3 -rn | head -20# Load balancer access log — when did traffic drop?
awk '/2026-03-16 18:/ {print}' /var/log/nginx/access.log | wc -l
# Compare to normal hour:
awk '/2026-03-16 17:/ {print}' /var/log/nginx/access.log | wc -lFINDING ROOT CAUSE — THE 5 WHYS
The 5 Whys technique: keep asking why until you reach a systemic cause.
Payment service went down (surface symptom).
Why? OOM killer terminated it.
Why? It ran out of memory.
Why? A memory leak in the connection pool.
Why? Connection pool size was unlimited.
Why? Nobody set a max connection pool limit when the service was written.
Root cause: no connection pool limit. Fix: add max pool size and add a memory alert.
Priya's root cause was not "the engineer forgot to set a limit." It was "our service template does not include connection pool configuration, so new services are deployed without it."
LOOKING FOR PATTERNS IN LOGS
# Were there warning signs before the crash?
grep "warn\|WARN" /opt/payment/logs/app.log | tail -100# Memory growing before OOM?
# If you have metrics: query Prometheus for memory trend in the hour before
# If no metrics: look for GC logs
grep "GC" /opt/payment/logs/app.log | grep "2026-03-16 1[78]:" | tail -30# Was there unusual traffic?
awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -10
# Compare to previous Friday at same timeTHE POSTMORTEM TEMPLATE
# Incident: Payment Service Down — 2026-03-16 18:14 to 18:47## Summary
Payment service crashed due to memory exhaustion. 3,200 transactions failed.
Duration: 47 minutes. Customer impact: all payments declined.## Timeline
17:52 — Deployment of v2.4.1 completed (connection pool change merged last week)
18:00 — Memory usage begins rising (visible in Netdata retrospectively)
18:14 — PagerDuty alert fires: payment service down
18:19 — On-call engineer Priya begins investigating
18:23 — Identified OOM kill in kern.log
18:31 — Heap dump analysis shows connection pool growing unboundedly
18:39 — Temporary fix: restarted with JAVA_OPTS="-Xmx2g" memory cap
18:47 — Service restored. Monitoring for stability.
19:30 — Root cause confirmed: missing max pool size configuration## Root Cause
PR #847 changed the database connection pool implementation but removed the
max_pool_size parameter that had existed in the previous version. No alert
existed for connection pool size. Memory grew unboundedly over 22 minutes.## Why We Missed It
- PR review did not catch the missing parameter
- No monitoring for connection pool size
- No memory alert existed (only a service-down alert)## Action Items
[ ] Add max_pool_size=50 to connection config — Vijay — 2026-03-18
[ ] Add Prometheus alert for memory > 80% — Priya — 2026-03-19
[ ] Add connection pool size to standard service template — Rajan — 2026-03-21
[ ] Add memory trend to deployment checklist — team — 2026-03-21THE COMMANDS FOR EVERY POSTMORTEM
# Exact failure time from logs:
journalctl -u myservice --since "YYYY-MM-DD HH:00" --until "YYYY-MM-DD HH:59" | grep -i "fatal\|crash\|oom\|killed"# What the OOM killer killed:
dmesg | grep -i "oom\|killed" | grep "YYYY-MM-DD HH"# What changed that day (deployments):
git log --since="YYYY-MM-DD 00:00" --until="YYYY-MM-DD HH:00" --oneline# Traffic comparison (was there a spike?):
for HOUR in 15 16 17 18 19; do echo -n "$HOUR:00 — "; grep "YYYY-MM-DD $HOUR:" /var/log/nginx/access.log | wc -l; donePriya's postmortem led to 4 action items. All 4 were completed within a week. The same failure never happened again. Six months later, when a different service had a memory leak, the alert fired at 80% memory usage — 20 minutes before it would have crashed — and the engineer fixed it during business hours with no customer impact.
The postmortem practice did not just prevent one failure. It built a culture of improving the system, not blaming the person.
A postmortem is a system improvement document, not a blame document — the goal is to fix the system, not punish the person
Pull logs from all systems first: journalctl, nginx access logs, git commits, database slow queries
The 5 Whys: keep asking why until you reach a systemic cause, not a human error
Every action item needs an owner and a deadline — postmortems without assigned items change nothing
Traffic comparison (requests this hour vs same hour last week) quickly identifies whether an incident was caused by a spike