Riya's first production incident

Navigate Linux, read logs, find the problem — the foundation of everything

⏱ 10 min 6 commands 5 takeaways

🧭

In this chapter

Riya

Junior support engineer, first week on the job

The story

Riya joined a Bengaluru fintech startup fresh out of college. Her first day in production support, at 3:47pm, her senior Anand ran to her desk. "Payment service is throwing errors. Go investigate."

Riya stared at the black terminal screen. She had used Windows her whole life. Anand typed ssh prod-pay-01 and handed her the keyboard.

Before you touch anything, always do these three things first:

hostname        # which server am I on?
whoami          # who am I logged in as?
uptime          # how long has it been running?

She was on prod-pay-01, logged in as riya, server running 12 days.

The Linux filesystem is like a building. / is the lobby. /var/log is the security office where all logs live. /opt is the office floors where apps live. /etc is the filing cabinet for configs. /tmp is the bin, cleared on reboot.

pwd              # where am I right now?
ls -lrt          # list files, newest at bottom
cd /var/log      # go to the logs folder
cd /opt/payment  # go to the payment app
cd ..            # go one level up
cd -             # go back to where you just were

Riya navigated to /opt/payment/logs. Anand said: Always use ls -lrt. The newest file is at the bottom.

Reading logs - the three commands you need:

tail -f payment.log          # watch live as new lines appear
tail -100 payment.log        # last 100 lines
grep "ERROR" payment.log     # find all error lines
grep -C 5 "ERROR" payment.log  # error plus 5 lines of context

She ran tail -f and watched. Then she saw it:

2026-03-16 15:47:23 ERROR PaymentService - Database connection timeout after 30s

Database connection timeout. Not a code bug. The database was unreachable.

She ran: grep -C 5 "timeout" payment.log | tail -30. Every error had timeout in it. Answer found in 3 minutes.

The first 60-second checklist:

hostname && whoami    # orient yourself
df -h                 # is disk full? causes 30 percent of incidents
free -h               # is memory low?
uptime                # what is the load average?

Riya's first incident resolved in 8 minutes. She saved the 4-command checklist on a sticky note. She still uses it today.

Key takeaways

Always run hostname && whoami first — confirm you are on the right server

ls -lrt puts the newest file at the bottom — always use this in log directories

tail -f watches a log live — your best tool during an active incident

grep -C 5 shows context around errors — use this not plain grep

Check df -h, free -h, uptime in the first 60 seconds of every incident

Commands from this chapter

$ hostname && whoami && uptime

The 3-second orient — always run first

$ df -h

Disk space — 100% disk = app crashes immediately

$ ls -lrt /opt/app/logs/

Newest log file at the bottom

$ tail -f app.log

Watch log live — Ctrl+C to stop

$ grep -C 5 "ERROR" app.log | tail -50

Last 50 errors with surrounding context

$ grep "ERROR" app.log | wc -l

Count total errors in log file