3am — all incidents at once
Real war games: app down, CPU spike, disk full, cannot connect
3:17am. Three alerts fire simultaneously. Payment service down. CPU 98% on app-server-02. Database connectivity errors from cart service.
Rule 1: Triage first - get the full picture before touching anything.
Run this on ALL affected servers first:
hostname && uptime && df -h | awk '$5+0 >= 90 {print "DISK CRITICAL:", $0}' && free -h | grep MemTen seconds per server. Full picture before making any changes.
War Game A - App is down:
systemctl status payment-svc
ps aux | grep -i payment | grep -v grep# If running but not responding:
curl -o /dev/null -s -w "HTTP:%{http_code} Time:%{time_total}s" http://localhost:8080/health# Check the logs:
tail -100 $(ls -t /opt/payment/logs/*.log | head -1) | grep -E "ERROR|Exception|FATAL"# Did the OS kill it?
grep "killed process" /var/log/kern.log | tail -5# Restart AFTER understanding why:
systemctl restart payment-svc
journalctl -u payment-svc -fWar Game B - CPU at 98%:
ps aux --sort=-%cpu | head -5
JAVA_PID=$(ps aux | grep java | grep -v grep | awk '{print $2}' | head -1)# Thread dump - what is Java doing right now?
kill -3 $JAVA_PID
tail -50 /opt/app/logs/app.log | grep "BLOCKED"# Kill it:
kill $JAVA_PID
sleep 15
ps aux | grep $JAVA_PID || echo "Process gone - good"War Game C - Cannot connect to database:
nslookup db-server-01 # DNS resolves?
ping -c 2 db-server-01 # routing works?
nc -zv db-server-01 5432 # port reachable?# nc: refused = PostgreSQL not running, start it
# nc: hangs = firewall blocking, open the port# Connection pool exhausted?
ss -tnp | grep 5432 | wc -lThe 50 commands you must know without thinking:
ORIENT (first 60 seconds):
hostname && whoami && uptime
df -h && free -hLOGS:
tail -f -n 100 app.log
grep -C 5 "ERROR" app.log | tail -50
grep "ERROR" app.log | wc -lPROCESSES:
ps aux --sort=-%cpu | head -10
kill PID / kill -9 PID
top (P=CPU, M=mem, q=quit)NETWORK:
ss -tlnp
nc -zv host port
curl -o /dev/null -s -w "%{http_code}" URLSERVICES:
systemctl status/start/restart service
journalctl -u service -f -n 100The 10 production rules:
1. hostname and whoami first - wrong server means wrong fix
2. Read logs for 2 minutes before touching anything
3. One change at a time
4. Backup before editing: cp config.xml config.xml.bak.$(date +%Y%m%d)
5. Try kill before kill -9 always
6. df -h takes 1 second - check it in every incident
7. Document every command into the incident ticket
8. Test config syntax before restarting
9. Connection refused means service down. Timed out means firewall blocking
10. After recovery: find the root cause, not just the fix
Triage all alerts first — full picture of all servers before touching anything
The incident sequence: orient then logs then process then network then system events
Connection refused means service not running. Timed out means firewall blocking
Document every command during an incident — paste into the ticket as you go
After recovery: root cause analysis is not optional — find WHY it happened