Sneha finds the hidden bottleneck

strace, lsof, perf, tcpdump — deep diagnostics for complex incidents

⏱ 13 min 5 commands 5 takeaways

🔬

In this chapter

Sneha

Senior SRE, payments platform

The story

The payment API was taking 8 seconds to respond. Code had not changed. Database queries were fast. Logs showed nothing unusual. CPU was at 15%. Memory was fine. Disk was fine.

This was Sneha's kind of problem.

strace shows every system call — when a program asks the Linux kernel to do something like read a file, open a socket, or wait for data.

strace -p 1234          # attach to running process
strace -p 1234 -e trace=network   # only network calls
strace -p 1234 -c      # summary: count and time per call type

Sneha ran strace -p 1234 -c and saw:

% time     seconds  usecs/call     calls   syscall
 98.7    7.932423       15864       500   poll
  0.8    0.064231          12      5234   read
  0.3    0.024156           5      4821   write

The process was spending 98.7% of its time in poll — waiting for something. Not CPU-bound. Not disk. Waiting for network.

lsof shows every file and connection a process has open:

lsof -p 1234                    # all files/sockets for process 1234
lsof -p 1234 | grep TCP         # only TCP connections
lsof -p 1234 | grep ESTABLISHED # only active connections
lsof -i :5432                   # who is connected to PostgreSQL

Sneha ran lsof -p 1234 | grep TCP and saw 500 connections all established to the same IP — but it was an internal DNS server, not the database.

500 open connections to DNS. All waiting. The code was resolving a hostname on every single database connection instead of caching the IP. With 500 concurrent requests, 500 DNS lookups were queuing up.

Fix: one DNS lookup at startup, cache the IP, reuse it.

tcpdump captures actual network traffic:

tcpdump -i eth0 port 53          # capture all DNS traffic
tcpdump -i eth0 host 10.0.0.5   # all traffic to/from an IP
tcpdump -i eth0 -w capture.pcap  # save to file for Wireshark

Sneha ran tcpdump -i eth0 port 53 and saw hundreds of DNS queries per second from the app. The evidence was conclusive.

perf for CPU profiling when CPU is high and you need to know which code is responsible:

perf top -p 1234          # live CPU usage by function
perf record -p 1234 -g    # record for 30 seconds
perf report               # show flame graph breakdown

vmstat 1 shows system stats every second:

r  b  swpd  free  buff  cache  si  so  bi  bo  in   cs  us  sy  id  wa
2  0     0  4523840 ...                                  15   3  80   2

Key columns:

r: processes waiting for CPU. If above CPU count, you are CPU-bound.
b: processes in uninterruptible sleep (disk wait)
wa: CPU percent waiting for disk. If high, disk is the bottleneck.

Diagnostic toolkit:

High CPU which code?          perf top, strace -c
Many open connections?        lsof -p PID
System calls blocking?        strace -p PID
Network traffic content?      tcpdump
Disk IO bottleneck?           iostat -x 1 5
Thread count exploding?       ps -T -p PID

Sneha fixed the DNS caching issue. Response time dropped from 8 seconds to 180ms. The incident took 35 minutes to diagnose with the right tools.

Key takeaways

strace -p PID -c shows a summary of system calls — identify immediately if CPU, disk, or network is the blocker

lsof -p PID | grep TCP shows every network connection a process has open right now

tcpdump captures actual network packets — see exactly what data is going where

Most mysterious slowdowns are waiting for network (DNS, external APIs) not CPU or disk

vmstat 1: watch the r column (CPU queue) and wa column (waiting for disk I/O)

Commands from this chapter

$ strace -p PID -c

Summary of system calls — shows what process spends time on

$ strace -p PID -e trace=network

Show only network-related system calls

$ lsof -p PID | grep ESTABLISHED

All active TCP connections for a process

$ tcpdump -i eth0 port 53 -n

Capture and display all DNS traffic

$ vmstat 1 10

System stats every second for 10 readings