Load Balancing

How it works

Load Balancing Overview
How it works
  1. Client resolves DNS/edge
  2. LB chooses backend via policy
  3. Health checks remove bad nodes
  4. Draining enables safe deploys

🎯 What is Load Balancing?

Load balancing is like having a smart traffic controller at a busy intersection - it directs incoming requests to the least busy server, ensuring no single server gets overwhelmed while others sit idle.
👥 Clients
⬇️
⚖️ Load Balancer
⬇️ ⬇️ ⬇️
🖥️ Server 1 | 🖥️ Server 2 | 🖥️ Server 3

Overview

  • Distributes requests across instances to improve throughput and resilience.
  • Decouples clients from instance topology and scales horizontally.
  • Works at L4 (faster, simpler) or L7 (smarter, content-aware).

When to use

  • You have multiple replicas of a stateless service.
  • Traffic patterns are bursty or diurnal and require elastic scaling.
  • You need controlled rollouts (canary/blue-green) and resilience to instance failures.

Trade-offs

  • L7 features add latency and operational complexity.
  • Sticky sessions simplify state but reduce fault tolerance and balance.
  • Health-check sensitivity vs. churn: aggressive checks can flap; slow checks delay recovery.

Patterns

  • Anycast/Geo DNS → Regional LBs → Service Mesh.
  • Connection draining for zero-downtime deploys.
  • Weighted routing for gradual rollouts or hotspot mitigation.

Anti-patterns

  • Single LB appliance without redundancy.
  • Global regex routing rules that become the de facto monolith.
  • Inconsistent timeout/retry budgets across client/LB/upstream.

🏗️ Load Balancer Types

🔌 Layer 4 (Transport Layer)

Works at TCP/UDP level

✨ Characteristics:

  • Routes based on IP address and port
  • Fast and simple
  • Protocol agnostic
  • Lower latency

🛠️ Examples:

AWS NLB HAProxy F5 BIG-IP

🌐 Layer 7 (Application Layer)

Works at HTTP/HTTPS level

✨ Characteristics:

  • Routes based on content (headers, URLs)
  • Intelligent routing decisions
  • SSL termination
  • Content-based rules

🛠️ Examples:

AWS ALB Nginx Cloudflare

🎲 Load Balancing Algorithms

🔄 Round Robin

Distribute requests sequentially across servers

Server 1 → Server 2 → Server 3 → Server 1...
Best for: Equal server capacity

⚖️ Weighted Round Robin

Servers receive requests proportional to their weights

Server 1 (2x) → Server 2 (1x) → Server 3 (3x)
Best for: Different server capacities

🔗 Least Connections

Route to server with fewest active connections

Choose server with min(active_connections)
Best for: Varying request processing times

⚡ Least Response Time

Route to server with fastest average response

Choose server with min(response_time)
Best for: Performance optimization

🔐 IP Hash

Route based on client IP hash for session persistence

hash(client_ip) % server_count
Best for: Session stickiness without cookies

🌍 Geographic

Route based on client geographical location

US clients → US servers, EU clients → EU servers
Best for: Latency optimization

🏥 Health Checks

💓 Ensuring Server Health

🌐 HTTP Health Checks

Regular HTTP requests to health endpoints

GET /health → 200 OK {"status": "healthy"}

🔌 TCP Health Checks

Check if server can accept connections

TCP connect to port 80 → Success/Failure

🔧 Custom Health Checks

Application-specific health validation

Check database connectivity, cache status, etc.

🎯 Actions on Health Check Failure:

  • Remove from rotation: Stop sending new requests
  • Drain connections: Let existing requests complete
  • Alert operations: Notify on-call engineers
  • Auto-recovery: Re-add when health checks pass

🔐 Session Persistence

🍪 Sticky Sessions

Route user to the same server consistently

Pros: Simple, maintains server-side state
Cons: Uneven load, server failure issues

🗄️ Session Sharing

Store sessions in shared external storage

Pros: Any server can handle requests
Cons: Additional complexity, network calls

🔄 Stateless Design

No server-side session state (JWT tokens)

Pros: Perfect scalability, no session complexity
Cons: Token size, security considerations

🔧 Advanced Features

🛡️ SSL Termination

Load balancer handles SSL encryption/decryption, reducing server load

📊 Traffic Shaping

Rate limiting and traffic control to prevent abuse

🔍 Request Routing

Route requests based on URL patterns, headers, or content

📈 Auto Scaling Integration

Automatically add/remove servers based on load metrics

🎯 Load Balancing Best Practices

🔄 Multiple Load Balancers: Avoid single point of failure
📊 Monitor Metrics: Track latency, error rates, server health
🔧 Regular Health Checks: Frequent but lightweight checks
📈 Capacity Planning: Plan for peak load scenarios

🧩 Real-World Scenarios

  • Blue/Green or Canary releases: Split 1–5% to the new version via L7 rules; watch p95/p99 before ramping traffic.
  • Multi-region active-active: Geo/DNS routing to nearest region with local LBs; fail over using health-checked records.
  • WebSockets/Realtime chat: Use L4 or L7 with sticky sessions or a shared pub/sub so messages reach the correct node.

⚠️ Pitfalls and Anti-patterns

  • Single load balancer as SPOF; always run at least 2 across zones.
  • Overly expensive L7 regex rules increase CPU and latency.
  • Sticky sessions hide imbalance and complicate failover; prefer stateless or shared session stores.
  • Missing connection draining causes dropped in-flight requests during deploys.
  • Health checks too slow/infrequent: slow detection, flapping, or false positives.
  • SSL misconfig: mixed ciphers, expired certs, or no OCSP stapling degrade UX.
  • Ignoring queue/backlog: SYN backlog/full connection pool looks like “random” 5xx.

🖼️ Quick Architecture Sketches


      # Anycast + Geo + Regional LBs
      Clients ──DNS/Anycast──▶ Edge/Geo ──▶ [ Region A LB ] ─▶ App Pods
                                       └──▶ [ Region B LB ] ─▶ App Pods
      

      # Multi-tier
      Client ▶ CDN/Edge ▶ WAF ▶ L7 LB ▶ (Service Mesh) ▶ Services
      

      # Zero-downtime deploy with drain
      LB ▶ S1,S2,S3
           ├─ mark S1 draining
           ├─ wait active=0
           └─ replace S1
      

🧪 Troubleshooting Checklist

  • Observe: 5xx by upstream, p95/p99, open connections, CPU on LB and backends.
  • Verify health endpoints are fast and isolated from heavy dependencies.
  • Timeouts consistent: client < LB < upstream; enable retries only for idempotent methods.
  • Ensure cross-zone balancing and sufficient connection pools.
  • Check container readiness/liveness gates before adding to rotation.
  • Warm-up new instances (preload JIT, caches) to avoid cold-start spikes.

❓ Interview Q&A (concise)

  • Q: L4 vs L7 trade-off? A: L4 is faster, fewer features; L7 enables content-based routing, TLS termination, cookies, but adds latency/CPU.
  • Q: Least-connections vs weighted RR? A: Least-connections adapts to variable request time; weighted RR suits heterogeneous capacity.
  • Q: When to use IP hash? A: Simple stickiness without cookies; beware of uneven distribution and NAT gateways.
  • Q: How to do zero-downtime deploys? A: Drain connections, health-check gate, staged rollout, and automated rollback.
  • Q: Hotspot mitigation? A: Weighted routing, autoscale, cache, shard by key, and rate limit abusive paths.