Dhruv builds the system that never goes down
nginx load balancer, keepalived, PostgreSQL replication — real high availability
Dhruv's company had one database server. One app server. One load balancer. Every Friday evening the CTO asked the same question: what happens if any of these goes down?
Dhruv's answer for 2 years was: we restore from backup. Takes about 2 hours.
The CTO stopped accepting that answer. He gave Dhruv 3 months to build a high availability setup. No more single points of failure.
WHAT HIGH AVAILABILITY MEANS
High availability (HA) means the system keeps working even when individual components fail. The target is usually expressed as uptime:
99% uptime = 3.65 days downtime per year
99.9% uptime = 8.77 hours downtime per year
99.99% uptime = 52.6 minutes downtime per yearFor most Indian startups, 99.9% is the right target. Getting from 99% to 99.9% is achievable with basic HA. Getting to 99.99% requires significant investment.
NGINX AS A LOAD BALANCER
nginx can distribute traffic across multiple app servers:
# /etc/nginx/nginx.confupstream app_servers {
least_conn; # route to server with fewest active connections
server 10.0.0.10:8080; # app server 1
server 10.0.0.11:8080; # app server 2
server 10.0.0.12:8080; # app server 3 # Health check (nginx plus only, or use keepalive):
keepalive 32;
}server {
listen 80;
location / {
proxy_pass http://app_servers;
proxy_next_upstream error timeout; # retry on failure
proxy_connect_timeout 5s;
proxy_send_timeout 30s;
proxy_read_timeout 30s;
}
}# Test config:
sudo nginx -t# After updating config:
sudo systemctl reload nginx # zero-downtime reloadKEEPALIVED — FLOATING IP FAILOVER
Keepalived manages a virtual/floating IP that moves between servers automatically. If the primary nginx fails, the secondary takes over the IP in seconds.
sudo apt install keepalived# /etc/keepalived/keepalived.conf on PRIMARY nginx server:
vrrp_instance VI_1 {
state MASTER
interface eth0
virtual_router_id 51
priority 100 # higher priority = preferred master
advert_int 1 authentication {
auth_type PASS
auth_pass secretpassword
} virtual_ipaddress {
10.0.0.100/24 # the floating IP — point DNS here
}
}# /etc/keepalived/keepalived.conf on SECONDARY nginx server:
vrrp_instance VI_1 {
state BACKUP
interface eth0
virtual_router_id 51
priority 90 # lower priority = stays backup
advert_int 1 authentication {
auth_type PASS
auth_pass secretpassword
} virtual_ipaddress {
10.0.0.100/24
}
}sudo systemctl enable keepalived && sudo systemctl start keepalived# Check which server holds the floating IP:
ip addr show eth0 | grep 10.0.0.100POSTGRESQL STREAMING REPLICATION
PostgreSQL streaming replication keeps a standby server in sync with the primary. If the primary fails, you can promote the standby.
# On PRIMARY server — postgresql.conf:
wal_level = replica
max_wal_senders = 3
wal_keep_size = 1GB# On PRIMARY — pg_hba.conf (allow standby to connect for replication):
host replication replicator 10.0.0.21/32 md5# On PRIMARY — create replication user:
sudo -u postgres createuser --replication -P replicator# On STANDBY — initial sync:
sudo -u postgres pg_basebackup -h 10.0.0.20 -U replicator -D /var/lib/postgresql/14/main -P -Xs -R
# -R creates standby.signal and recovery config automatically# On STANDBY — postgresql.conf:
primary_conninfo = 'host=10.0.0.20 port=5432 user=replicator password=secret'sudo systemctl start postgresql# Check replication lag on PRIMARY:
sudo -u postgres psql -c "SELECT client_addr, state, sent_lsn, write_lsn, replay_lsn FROM pg_stat_replication;"# Promote standby to primary (when primary fails):
sudo -u postgres pg_ctl promote -D /var/lib/postgresql/14/mainHEALTH CHECKS AND AUTOMATIC FAILOVER
#!/bin/bash
# /usr/local/bin/health-check-primary.sh
# Run every 30 seconds via cron or systemd timerPRIMARY="10.0.0.20"
STANDBY="10.0.0.21"
ALERT_WEBHOOK="https://api.telegram.org/bot${TOKEN}/sendMessage"check_primary() {
pg_isready -h $PRIMARY -U appuser -d appdb -t 5 > /dev/null 2>&1
}if ! check_primary; then
curl -s -X POST "$ALERT_WEBHOOK" -d "chat_id=$CHAT_ID" \
-d "text=CRITICAL: Primary database $PRIMARY is DOWN. Manual failover may be needed."
logger "Primary database DOWN - standby is at $STANDBY"
fiWHAT DHRUV BUILT
Internet
|
DNS (Route 53 / Cloudflare)
|
Floating IP 10.0.0.100 (keepalived)
/ \
nginx-01 (MASTER) nginx-02 (BACKUP)
10.0.0.10 10.0.0.11
| | |
app-01 app-02 app-03
10.0.0.20 10.0.0.21 10.0.0.22
|
postgres-primary postgres-standby
10.0.0.30 10.0.0.31
(streaming replication, 30s lag)What goes down and what happens:
One app server dies → nginx routes around it automatically
nginx-01 dies → keepalived moves floating IP to nginx-02 in 3 seconds
Primary DB dies → alert fires, manual promotion of standby in 5 minutesRecovery time went from 2 hours to 5 minutes for the worst case. For the common case (one app server failing) it became zero — traffic routes around it automatically.
proxy_next_upstream error timeout in nginx automatically retries failed requests on the next server — zero user impact when one app server dies
Keepalived moves a floating IP between servers in 3 seconds — point your DNS to the floating IP, not individual servers
PostgreSQL streaming replication: the standby stays live and queryable, promotion takes one command if the primary fails
pg_basebackup -R on the standby creates the standby.signal and recovery config automatically
Target 99.9% uptime for startups — that is under 9 hours downtime per year and achievable with two app servers and one standby database