Learn › Linux for Production Support › Dhruv builds the system that never goes down

Linux for Production Support Ch 32 / 32 Expert

🏗️

Dhruv builds the system that never goes down

nginx load balancer, keepalived, PostgreSQL replication — real high availability

⏱ 15 min 6 commands 5 takeaways

🏗️

In this chapter

Dhruv

Infrastructure engineer, eliminating single points of failure

The story

Dhruv's company had one database server. One app server. One load balancer. Every Friday evening the CTO asked the same question: what happens if any of these goes down?

Dhruv's answer for 2 years was: we restore from backup. Takes about 2 hours.

The CTO stopped accepting that answer. He gave Dhruv 3 months to build a high availability setup. No more single points of failure.

WHAT HIGH AVAILABILITY MEANS

High availability (HA) means the system keeps working even when individual components fail. The target is usually expressed as uptime:

99%   uptime = 3.65 days downtime per year
99.9% uptime = 8.77 hours downtime per year
99.99% uptime = 52.6 minutes downtime per year

For most Indian startups, 99.9% is the right target. Getting from 99% to 99.9% is achievable with basic HA. Getting to 99.99% requires significant investment.

NGINX AS A LOAD BALANCER

nginx can distribute traffic across multiple app servers:

# /etc/nginx/nginx.conf

upstream app_servers {
    least_conn;                 # route to server with fewest active connections
    server 10.0.0.10:8080;     # app server 1
    server 10.0.0.11:8080;     # app server 2
    server 10.0.0.12:8080;     # app server 3

    # Health check (nginx plus only, or use keepalive):
    keepalive 32;
}

server {
    listen 80;
    location / {
        proxy_pass http://app_servers;
        proxy_next_upstream error timeout;   # retry on failure
        proxy_connect_timeout 5s;
        proxy_send_timeout 30s;
        proxy_read_timeout 30s;
    }
}

# Test config:
sudo nginx -t

# After updating config:
sudo systemctl reload nginx     # zero-downtime reload

KEEPALIVED — FLOATING IP FAILOVER

Keepalived manages a virtual/floating IP that moves between servers automatically. If the primary nginx fails, the secondary takes over the IP in seconds.

sudo apt install keepalived

# /etc/keepalived/keepalived.conf on PRIMARY nginx server:
vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 100            # higher priority = preferred master
    advert_int 1

    authentication {
        auth_type PASS
        auth_pass secretpassword
    }

    virtual_ipaddress {
        10.0.0.100/24       # the floating IP — point DNS here
    }
}

# /etc/keepalived/keepalived.conf on SECONDARY nginx server:
vrrp_instance VI_1 {
    state BACKUP
    interface eth0
    virtual_router_id 51
    priority 90             # lower priority = stays backup
    advert_int 1

    authentication {
        auth_type PASS
        auth_pass secretpassword
    }

    virtual_ipaddress {
        10.0.0.100/24
    }
}

sudo systemctl enable keepalived && sudo systemctl start keepalived

# Check which server holds the floating IP:
ip addr show eth0 | grep 10.0.0.100

POSTGRESQL STREAMING REPLICATION

PostgreSQL streaming replication keeps a standby server in sync with the primary. If the primary fails, you can promote the standby.

# On PRIMARY server — postgresql.conf:
wal_level = replica
max_wal_senders = 3
wal_keep_size = 1GB

# On PRIMARY — pg_hba.conf (allow standby to connect for replication):
host replication replicator 10.0.0.21/32 md5

# On PRIMARY — create replication user:
sudo -u postgres createuser --replication -P replicator

# On STANDBY — initial sync:
sudo -u postgres pg_basebackup -h 10.0.0.20 -U replicator -D /var/lib/postgresql/14/main -P -Xs -R
# -R creates standby.signal and recovery config automatically

# On STANDBY — postgresql.conf:
primary_conninfo = 'host=10.0.0.20 port=5432 user=replicator password=secret'

sudo systemctl start postgresql

# Check replication lag on PRIMARY:
sudo -u postgres psql -c "SELECT client_addr, state, sent_lsn, write_lsn, replay_lsn FROM pg_stat_replication;"

# Promote standby to primary (when primary fails):
sudo -u postgres pg_ctl promote -D /var/lib/postgresql/14/main

HEALTH CHECKS AND AUTOMATIC FAILOVER

#!/bin/bash
# /usr/local/bin/health-check-primary.sh
# Run every 30 seconds via cron or systemd timer

PRIMARY="10.0.0.20"
STANDBY="10.0.0.21"
ALERT_WEBHOOK="https://api.telegram.org/bot${TOKEN}/sendMessage"

check_primary() {
    pg_isready -h $PRIMARY -U appuser -d appdb -t 5 > /dev/null 2>&1
}

if ! check_primary; then
    curl -s -X POST "$ALERT_WEBHOOK" -d "chat_id=$CHAT_ID" \
        -d "text=CRITICAL: Primary database $PRIMARY is DOWN. Manual failover may be needed."
    logger "Primary database DOWN - standby is at $STANDBY"
fi

WHAT DHRUV BUILT

Internet
   |
DNS (Route 53 / Cloudflare)
   |
Floating IP 10.0.0.100 (keepalived)
  /                  \
nginx-01 (MASTER)    nginx-02 (BACKUP)
10.0.0.10            10.0.0.11
  |         |         |
app-01  app-02  app-03
10.0.0.20 10.0.0.21 10.0.0.22
            |
postgres-primary     postgres-standby
10.0.0.30            10.0.0.31
(streaming replication, 30s lag)

What goes down and what happens:

One app server dies  →  nginx routes around it automatically
nginx-01 dies        →  keepalived moves floating IP to nginx-02 in 3 seconds
Primary DB dies      →  alert fires, manual promotion of standby in 5 minutes

Recovery time went from 2 hours to 5 minutes for the worst case. For the common case (one app server failing) it became zero — traffic routes around it automatically.

Key takeaways

proxy_next_upstream error timeout in nginx automatically retries failed requests on the next server — zero user impact when one app server dies

Keepalived moves a floating IP between servers in 3 seconds — point your DNS to the floating IP, not individual servers

PostgreSQL streaming replication: the standby stays live and queryable, promotion takes one command if the primary fails

pg_basebackup -R on the standby creates the standby.signal and recovery config automatically

Target 99.9% uptime for startups — that is under 9 hours downtime per year and achievable with two app servers and one standby database

Commands from this chapter

$ sudo nginx -t && sudo systemctl reload nginx

Test nginx config and reload with zero downtime

$ ip addr show eth0 | grep 10.0.0.100

Check which server currently holds the floating IP

$ sudo -u postgres psql -c "SELECT client_addr, state, replay_lsn FROM pg_stat_replication;"

Check PostgreSQL replication status and lag

$ sudo -u postgres pg_basebackup -h primary -U replicator -D /var/lib/postgresql/14/main -P -Xs -R

Initial standby sync from primary

$ sudo -u postgres pg_ctl promote -D /var/lib/postgresql/14/main

Promote standby to primary when primary fails

$ pg_isready -h 10.0.0.20 -U appuser -t 5

Health check primary database — returns 0 if accepting connections