Sneha finds the hidden bottleneck
strace, lsof, perf, tcpdump — deep diagnostics for complex incidents
The payment API was taking 8 seconds to respond. Code had not changed. Database queries were fast. Logs showed nothing unusual. CPU was at 15%. Memory was fine. Disk was fine.
This was Sneha's kind of problem.
strace shows every system call — when a program asks the Linux kernel to do something like read a file, open a socket, or wait for data.
strace -p 1234 # attach to running process
strace -p 1234 -e trace=network # only network calls
strace -p 1234 -c # summary: count and time per call typeSneha ran strace -p 1234 -c and saw:
% time seconds usecs/call calls syscall
98.7 7.932423 15864 500 poll
0.8 0.064231 12 5234 read
0.3 0.024156 5 4821 writeThe process was spending 98.7% of its time in poll — waiting for something. Not CPU-bound. Not disk. Waiting for network.
lsof shows every file and connection a process has open:
lsof -p 1234 # all files/sockets for process 1234
lsof -p 1234 | grep TCP # only TCP connections
lsof -p 1234 | grep ESTABLISHED # only active connections
lsof -i :5432 # who is connected to PostgreSQLSneha ran lsof -p 1234 | grep TCP and saw 500 connections all established to the same IP — but it was an internal DNS server, not the database.
500 open connections to DNS. All waiting. The code was resolving a hostname on every single database connection instead of caching the IP. With 500 concurrent requests, 500 DNS lookups were queuing up.
Fix: one DNS lookup at startup, cache the IP, reuse it.
tcpdump captures actual network traffic:
tcpdump -i eth0 port 53 # capture all DNS traffic
tcpdump -i eth0 host 10.0.0.5 # all traffic to/from an IP
tcpdump -i eth0 -w capture.pcap # save to file for WiresharkSneha ran tcpdump -i eth0 port 53 and saw hundreds of DNS queries per second from the app. The evidence was conclusive.
perf for CPU profiling when CPU is high and you need to know which code is responsible:
perf top -p 1234 # live CPU usage by function
perf record -p 1234 -g # record for 30 seconds
perf report # show flame graph breakdownvmstat 1 shows system stats every second:
r b swpd free buff cache si so bi bo in cs us sy id wa
2 0 0 4523840 ... 15 3 80 2Key columns:
r: processes waiting for CPU. If above CPU count, you are CPU-bound.
b: processes in uninterruptible sleep (disk wait)
wa: CPU percent waiting for disk. If high, disk is the bottleneck.Diagnostic toolkit:
High CPU which code? perf top, strace -c
Many open connections? lsof -p PID
System calls blocking? strace -p PID
Network traffic content? tcpdump
Disk IO bottleneck? iostat -x 1 5
Thread count exploding? ps -T -p PIDSneha fixed the DNS caching issue. Response time dropped from 8 seconds to 180ms. The incident took 35 minutes to diagnose with the right tools.
strace -p PID -c shows a summary of system calls — identify immediately if CPU, disk, or network is the blocker
lsof -p PID | grep TCP shows every network connection a process has open right now
tcpdump captures actual network packets — see exactly what data is going where
Most mysterious slowdowns are waiting for network (DNS, external APIs) not CPU or disk
vmstat 1: watch the r column (CPU queue) and wa column (waiting for disk I/O)