Divya chases the invisible p99
Latency percentiles, iostat, vmstat, strace — measuring what you cannot see
Divya was brought in to fix a persistent problem: the payment service was randomly slow. Not always. Not predictably. Median response time was fine. But p99 latency was 4 seconds, and the SLA required under 1 second at p99.
Three previous engineers had looked at it and found nothing. CPU was fine. Memory was fine. Database queries were fast. The problem was invisible.
Divya decided to measure everything.
LATENCY PERCENTILES — WHAT P99 ACTUALLY MEANS
Before measuring, understand what you are measuring.
If you serve 1000 requests, median (p50) is request number 500 sorted by response time. p95 is request number 950. p99 is request number 990.
If your p50 is 50ms and your p99 is 4000ms, your median user is fine. Your 1-in-100 user waits 4 seconds. That 1-in-100 is often your most important customer.
Finding p99 from logs:
# If your logs have response times in milliseconds in column 8:
awk '{print $8}' access.log | sort -n | awk '
BEGIN{count=0}
{lines[count]=$1; count++}
END{
p50=int(count*0.50); p95=int(count*0.95); p99=int(count*0.99);
print "p50:", lines[p50], "ms"
print "p95:", lines[p95], "ms"
print "p99:", lines[p99], "ms"
print "max:", lines[count-1], "ms"
}'NETWORK LATENCY — IS THE PROBLEM ON THE WIRE?
ping -c 100 db-server-01 # 100 pings, look at min/avg/max/mdev
# mdev (standard deviation) tells you about consistency
# High mdev = inconsistent latency = network jitter# Measure actual TCP connection time:
time nc -zv db-server-01 5432# Trace the route and measure each hop:
mtr --report --report-cycles 100 db-server-01
# Shows packet loss and latency at each network hopMEASURING DISK I/O
# Is the disk fast enough?
dd if=/dev/zero of=/tmp/testfile bs=1M count=1000 conv=fdatasync
# Shows write speed — should be hundreds of MB/s on SSD, tens on HDD# Real-time disk I/O:
iostat -x 1 10
# %util column: 100% = disk is saturated
# await: average wait time for I/O requests in milliseconds
# If await > 20ms on SSD, something is wrong# Which process is using the disk?
iotop -b -n 3 | head -20
# Shows I/O per process — find the disk hogMEASURING MEMORY PRESSURE
free -h # overall memory
vmstat 1 10 # watch memory over 10 seconds
# si (swap in) and so (swap out) should both be 0
# If they are non-zero: system is using swap = serious memory pressure# Is the kernel reclaiming memory too aggressively?
cat /proc/sys/vm/swappiness # default 60, should be 10 for production servers
sudo sysctl vm.swappiness=10 # change without reboot
echo "vm.swappiness=10" | sudo tee -a /etc/sysctl.conf # persist after reboot# OOM killer history:
dmesg | grep -i "oom\|killed\|out of memory" | tail -10FINDING SLOW DATABASE QUERIES
# PostgreSQL: slow queries in last 24 hours
sudo -u postgres psql -c "SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 20;"# Check if PostgreSQL has active locks:
sudo -u postgres psql -c "SELECT pid, wait_event, query FROM pg_stat_activity WHERE wait_event IS NOT NULL;"# MySQL: show currently running slow queries:
sudo mysql -e "SHOW PROCESSLIST;" | awk '$6 > 5' # queries running more than 5 secondsNETWORK CONNECTIONS — TOO MANY?
ss -s
# Shows connection counts by state
# Large TIME_WAIT count = connections closing but not yet cleaned up
# Large CLOSE_WAIT = application not closing connections properly# Count connections per state:
ss -tan | awk 'NR>1{print $1}' | sort | uniq -c | sort -rn# Connection count growing? Watch it:
watch -n 2 'ss -s | grep "TCP:"'# Which remote IPs are connecting most?
ss -tn | awk 'NR>1{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head -10WHAT DIVYA FOUND
After instrumenting everything, Divya found the answer with this single command:
strace -p $(pgrep -f payment) -e trace=network -cThe payment service was making a DNS lookup on every database connection. The DNS server was located in a different data centre with 45ms latency. At p99, several requests piled up waiting for DNS responses simultaneously.
Fix: add the database IP to /etc/hosts on the payment server.
echo "10.0.0.6 db-server-01" | sudo tee -a /etc/hostsp99 latency dropped from 4000ms to 180ms. The invisible problem was a missing /etc/hosts entry.
The lesson: always measure before assuming. CPU and memory being fine means nothing. The problem could be DNS, network jitter, disk wait, or a lock. Measure first, conclude second.
p99 latency means 1 in 100 requests — your median being fine while p99 is bad means 1% of users suffer
mtr --report gives you per-hop latency and packet loss — better than traceroute for diagnosing network issues
iostat -x: %util near 100% means disk is saturated, await > 20ms on SSD indicates a problem
vm.swappiness should be 10 on production servers (not the default 60) — reduces swap usage
The invisible problem is often DNS — a single missing /etc/hosts entry caused a 4-second p99 latency