Learn › Linux for Production Support › 3am — all incidents at once

Linux for Production Support Ch 14 / 32 Expert

🚨

3am — all incidents at once

Real war games: app down, CPU spike, disk full, cannot connect

⏱ 15 min 6 commands 5 takeaways

🚨

In this chapter

The Team

Full production support team — war game scenarios

The story

3:17am. Three alerts fire simultaneously. Payment service down. CPU 98% on app-server-02. Database connectivity errors from cart service.

Rule 1: Triage first - get the full picture before touching anything.

Run this on ALL affected servers first:

hostname && uptime && df -h | awk '$5+0 >= 90 {print "DISK CRITICAL:", $0}' && free -h | grep Mem

Ten seconds per server. Full picture before making any changes.

War Game A - App is down:

systemctl status payment-svc
ps aux | grep -i payment | grep -v grep

# If running but not responding:
curl -o /dev/null -s -w "HTTP:%{http_code} Time:%{time_total}s" http://localhost:8080/health

# Check the logs:
tail -100 $(ls -t /opt/payment/logs/*.log | head -1) | grep -E "ERROR|Exception|FATAL"

# Did the OS kill it?
grep "killed process" /var/log/kern.log | tail -5

# Restart AFTER understanding why:
systemctl restart payment-svc
journalctl -u payment-svc -f

War Game B - CPU at 98%:

ps aux --sort=-%cpu | head -5
JAVA_PID=$(ps aux | grep java | grep -v grep | awk '{print $2}' | head -1)

# Thread dump - what is Java doing right now?
kill -3 $JAVA_PID
tail -50 /opt/app/logs/app.log | grep "BLOCKED"

# Kill it:
kill $JAVA_PID
sleep 15
ps aux | grep $JAVA_PID || echo "Process gone - good"

War Game C - Cannot connect to database:

nslookup db-server-01         # DNS resolves?
ping -c 2 db-server-01        # routing works?
nc -zv db-server-01 5432      # port reachable?

# nc: refused = PostgreSQL not running, start it
# nc: hangs = firewall blocking, open the port

# Connection pool exhausted?
ss -tnp | grep 5432 | wc -l

The 50 commands you must know without thinking:

ORIENT (first 60 seconds):

hostname && whoami && uptime
df -h && free -h

LOGS:

tail -f -n 100 app.log
grep -C 5 "ERROR" app.log | tail -50
grep "ERROR" app.log | wc -l

PROCESSES:

ps aux --sort=-%cpu | head -10
kill PID / kill -9 PID
top (P=CPU, M=mem, q=quit)

NETWORK:

ss -tlnp
nc -zv host port
curl -o /dev/null -s -w "%{http_code}" URL

SERVICES:

systemctl status/start/restart service
journalctl -u service -f -n 100

The 10 production rules:

1. hostname and whoami first - wrong server means wrong fix

2. Read logs for 2 minutes before touching anything

3. One change at a time

4. Backup before editing: cp config.xml config.xml.bak.$(date +%Y%m%d)

5. Try kill before kill -9 always

6. df -h takes 1 second - check it in every incident

7. Document every command into the incident ticket

8. Test config syntax before restarting

9. Connection refused means service down. Timed out means firewall blocking

10. After recovery: find the root cause, not just the fix

Key takeaways

Triage all alerts first — full picture of all servers before touching anything

The incident sequence: orient then logs then process then network then system events

Connection refused means service not running. Timed out means firewall blocking

Document every command during an incident — paste into the ticket as you go

After recovery: root cause analysis is not optional — find WHY it happened

Commands from this chapter

$ hostname && uptime && df -h | awk '$5+0>=90' && free -h

Complete 10-second server health picture

$ grep 'ERROR' app.log | awk '{print $NF}' | sort | uniq -c | sort -rn | head

Top error types ranked by frequency

$ watch -n 2 'ps aux --sort=-%cpu | head -5'

Watch top CPU processes live every 2s

$ df -h | awk '$5+0>=90{print "CRITICAL:", $0}'

Alert on any partition above 90%

$ journalctl --since '1 hour ago' -p err

All system errors from the last hour

$ for s in web-{01..03}; do echo "=== $s ==="; ssh $s uptime; done

Check multiple servers in one command