Learn 🧠 All Concepts (20) 🤖 What is an LLM? 📚 RAG Explained ⚡ AI Agents 💻 Run AI Locally 🇮🇳 AI in India 📖 Learn Tracks 🔧 DevOps Track ⚙️ AI Ops Track 🗺️ AI Engineer Roadmap
Tools 🔧 AI Tools Directory 🔓 Open Source AI ⭐ Top GitHub Repos ✦ Claude Skill Repos 🚀 Ready-to-Deploy Projects
Build 🏗️ Build Hub 🎯 Master Prompts 🧩 RAG Agents 🚀 App Megaprompts
Workflows ⚡ All Workflows (22) 🎥 Text to Video 🎞️ Image to Video 🔊 Text to Speech ♻️ Automation
Resources 🧪 Colab Notebooks ⚙️ n8n Workflows 📈 Algo Trading 💰 Passive Income
🗂️ Browse All Topics About AItheGuru
Learn Linux for Production Support Riya's first production incident
Linux for Production Support Ch 9 / 32 Beginner
🧭

Riya's first production incident

Navigate Linux, read logs, find the problem — the foundation of everything

⏱ 10 min 6 commands 5 takeaways
🧭
In this chapter
Riya
Junior support engineer, first week on the job
The story

Riya joined a Bengaluru fintech startup fresh out of college. Her first day in production support, at 3:47pm, her senior Anand ran to her desk. "Payment service is throwing errors. Go investigate."

Riya stared at the black terminal screen. She had used Windows her whole life. Anand typed ssh prod-pay-01 and handed her the keyboard.

Before you touch anything, always do these three things first:

hostname        # which server am I on?
whoami          # who am I logged in as?
uptime          # how long has it been running?

She was on prod-pay-01, logged in as riya, server running 12 days.

The Linux filesystem is like a building. / is the lobby. /var/log is the security office where all logs live. /opt is the office floors where apps live. /etc is the filing cabinet for configs. /tmp is the bin, cleared on reboot.

pwd              # where am I right now?
ls -lrt          # list files, newest at bottom
cd /var/log      # go to the logs folder
cd /opt/payment  # go to the payment app
cd ..            # go one level up
cd -             # go back to where you just were

Riya navigated to /opt/payment/logs. Anand said: Always use ls -lrt. The newest file is at the bottom.

Reading logs - the three commands you need:

tail -f payment.log          # watch live as new lines appear
tail -100 payment.log        # last 100 lines
grep "ERROR" payment.log     # find all error lines
grep -C 5 "ERROR" payment.log  # error plus 5 lines of context

She ran tail -f and watched. Then she saw it:

2026-03-16 15:47:23 ERROR PaymentService - Database connection timeout after 30s

Database connection timeout. Not a code bug. The database was unreachable.

She ran: grep -C 5 "timeout" payment.log | tail -30. Every error had timeout in it. Answer found in 3 minutes.

The first 60-second checklist:

hostname && whoami    # orient yourself
df -h                 # is disk full? causes 30 percent of incidents
free -h               # is memory low?
uptime                # what is the load average?

Riya's first incident resolved in 8 minutes. She saved the 4-command checklist on a sticky note. She still uses it today.

Key takeaways

Always run hostname && whoami first — confirm you are on the right server

ls -lrt puts the newest file at the bottom — always use this in log directories

tail -f watches a log live — your best tool during an active incident

grep -C 5 shows context around errors — use this not plain grep

Check df -h, free -h, uptime in the first 60 seconds of every incident

Commands from this chapter
$ hostname && whoami && uptime
The 3-second orient — always run first
$ df -h
Disk space — 100% disk = app crashes immediately
$ ls -lrt /opt/app/logs/
Newest log file at the bottom
$ tail -f app.log
Watch log live — Ctrl+C to stop
$ grep -C 5 "ERROR" app.log | tail -50
Last 50 errors with surrounding context
$ grep "ERROR" app.log | wc -l
Count total errors in log file