Learn › Linux for Production Support › Divya chases the invisible p99

Linux for Production Support Ch 21 / 32 Expert

📐

Divya chases the invisible p99

Latency percentiles, iostat, vmstat, strace — measuring what you cannot see

⏱ 14 min 6 commands 5 takeaways

📐

In this chapter

Divya

Senior SRE, chasing a p99 problem

The story

Divya was brought in to fix a persistent problem: the payment service was randomly slow. Not always. Not predictably. Median response time was fine. But p99 latency was 4 seconds, and the SLA required under 1 second at p99.

Three previous engineers had looked at it and found nothing. CPU was fine. Memory was fine. Database queries were fast. The problem was invisible.

Divya decided to measure everything.

LATENCY PERCENTILES — WHAT P99 ACTUALLY MEANS

Before measuring, understand what you are measuring.

If you serve 1000 requests, median (p50) is request number 500 sorted by response time. p95 is request number 950. p99 is request number 990.

If your p50 is 50ms and your p99 is 4000ms, your median user is fine. Your 1-in-100 user waits 4 seconds. That 1-in-100 is often your most important customer.

Finding p99 from logs:

# If your logs have response times in milliseconds in column 8:
awk '{print $8}' access.log | sort -n | awk '
    BEGIN{count=0}
    {lines[count]=$1; count++}
    END{
        p50=int(count*0.50); p95=int(count*0.95); p99=int(count*0.99);
        print "p50:", lines[p50], "ms"
        print "p95:", lines[p95], "ms"
        print "p99:", lines[p99], "ms"
        print "max:", lines[count-1], "ms"
    }'

NETWORK LATENCY — IS THE PROBLEM ON THE WIRE?

ping -c 100 db-server-01                # 100 pings, look at min/avg/max/mdev
# mdev (standard deviation) tells you about consistency
# High mdev = inconsistent latency = network jitter

# Measure actual TCP connection time:
time nc -zv db-server-01 5432

# Trace the route and measure each hop:
mtr --report --report-cycles 100 db-server-01
# Shows packet loss and latency at each network hop

MEASURING DISK I/O

# Is the disk fast enough?
dd if=/dev/zero of=/tmp/testfile bs=1M count=1000 conv=fdatasync
# Shows write speed — should be hundreds of MB/s on SSD, tens on HDD

# Real-time disk I/O:
iostat -x 1 10
# %util column: 100% = disk is saturated
# await: average wait time for I/O requests in milliseconds
# If await > 20ms on SSD, something is wrong

# Which process is using the disk?
iotop -b -n 3 | head -20
# Shows I/O per process — find the disk hog

MEASURING MEMORY PRESSURE

free -h                     # overall memory
vmstat 1 10                 # watch memory over 10 seconds
# si (swap in) and so (swap out) should both be 0
# If they are non-zero: system is using swap = serious memory pressure

# Is the kernel reclaiming memory too aggressively?
cat /proc/sys/vm/swappiness   # default 60, should be 10 for production servers
sudo sysctl vm.swappiness=10  # change without reboot
echo "vm.swappiness=10" | sudo tee -a /etc/sysctl.conf  # persist after reboot

# OOM killer history:
dmesg | grep -i "oom\|killed\|out of memory" | tail -10

FINDING SLOW DATABASE QUERIES

# PostgreSQL: slow queries in last 24 hours
sudo -u postgres psql -c "SELECT query, mean_exec_time, calls
    FROM pg_stat_statements
    ORDER BY mean_exec_time DESC
    LIMIT 20;"

# Check if PostgreSQL has active locks:
sudo -u postgres psql -c "SELECT pid, wait_event, query FROM pg_stat_activity WHERE wait_event IS NOT NULL;"

# MySQL: show currently running slow queries:
sudo mysql -e "SHOW PROCESSLIST;" | awk '$6 > 5'  # queries running more than 5 seconds

NETWORK CONNECTIONS — TOO MANY?

ss -s
# Shows connection counts by state
# Large TIME_WAIT count = connections closing but not yet cleaned up
# Large CLOSE_WAIT = application not closing connections properly

# Count connections per state:
ss -tan | awk 'NR>1{print $1}' | sort | uniq -c | sort -rn

# Connection count growing? Watch it:
watch -n 2 'ss -s | grep "TCP:"'

# Which remote IPs are connecting most?
ss -tn | awk 'NR>1{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head -10

WHAT DIVYA FOUND

After instrumenting everything, Divya found the answer with this single command:

strace -p $(pgrep -f payment) -e trace=network -c

The payment service was making a DNS lookup on every database connection. The DNS server was located in a different data centre with 45ms latency. At p99, several requests piled up waiting for DNS responses simultaneously.

Fix: add the database IP to /etc/hosts on the payment server.

echo "10.0.0.6 db-server-01" | sudo tee -a /etc/hosts

p99 latency dropped from 4000ms to 180ms. The invisible problem was a missing /etc/hosts entry.

The lesson: always measure before assuming. CPU and memory being fine means nothing. The problem could be DNS, network jitter, disk wait, or a lock. Measure first, conclude second.

Key takeaways

p99 latency means 1 in 100 requests — your median being fine while p99 is bad means 1% of users suffer

mtr --report gives you per-hop latency and packet loss — better than traceroute for diagnosing network issues

iostat -x: %util near 100% means disk is saturated, await > 20ms on SSD indicates a problem

vm.swappiness should be 10 on production servers (not the default 60) — reduces swap usage

The invisible problem is often DNS — a single missing /etc/hosts entry caused a 4-second p99 latency

Commands from this chapter

$ mtr --report --report-cycles 100 hostname

Per-hop latency and packet loss — better than ping for diagnosis

$ iostat -x 1 5

Disk I/O stats — watch %util and await columns

$ vmstat 1 10

System stats every second — si/so non-zero means swap pressure

$ dd if=/dev/zero of=/tmp/test bs=1M count=1000 conv=fdatasync

Measure raw disk write speed

$ sudo sysctl vm.swappiness=10

Reduce swap aggressiveness — important for production Java apps

$ ss -tan | awk 'NR>1{print $1}' | sort | uniq -c | sort -rn

Count connections by state — find TIME_WAIT and CLOSE_WAIT accumulation