Availability

How it works

Availability Overview
How it works
  1. Eliminate SPOFs
  2. Add redundancy/LB
  3. Design failover
  4. Measure uptime and MTTR

Overview

  • Availability is the percentage of time a system is up and able to serve requests.
  • Achieve via eliminating single points of failure, redundancy, health checks, and automated failover.
  • Measured with uptime SLI and MTTR/MTBF; governed by SLOs.

When to use

  • Customer-facing services with strict uptime commitments (e.g., 99.9%+).
  • Systems with spiky/variable load where autoscaling and failover are required.

Trade-offs

  • Cost of redundancy and multi-region deployments; complexity of stateful failover.
  • Consistency trade-offs under partitions (CAP) may reduce data freshness.

Patterns

  • Active-active across zones/regions with global load balancing.
  • Health probes + automated instance replacement; rolling deploys; blue/green.
  • Data replication and read replicas; graceful degradation and load shedding under stress.

Anti-patterns

  • Single AZ deployment; manual-only failovers; no capacity headroom.
  • Long timeouts causing resource exhaustion; lack of backpressure during incidents.

📐 Quick Diagram


      Client ▶ Anycast DNS ▶ GSLB ▶ Region A (active) │ Region B (active)
                                     LB ▶ App ▶ DB     LB ▶ App ▶ DB
      

❓ Interview Q&A (concise)

  • Q: Calculate downtime for 99.9%? A: ~8.76 hours/year; 99.99% ≈ 52.6 minutes.
  • Q: Improve availability fast? A: Remove SPOFs, add health checks + autoscaling, shorten MTTR with automation.
  • Q: CAP implications? A: During partitions you often choose AP for read availability or CP for strict correctness.

🎯 What is Availability?

Availability is like the reliability of a restaurant - it's the percentage of time the restaurant is open and serving customers. For systems, it's about being operational and accessible.
99.9%
0.1%
Operational
Downtime

📊 Availability Levels

| Availability | Downtime per Year | Downtime per Month | |-------------|-------------------|-------------------| | 90% | 36.5 days | 3 days | | 99% | 3.65 days | 7.2 hours | | 99.9% | 8.76 hours | 43.2 minutes | | 99.99% | 52.56 minutes | 4.32 minutes | | 99.999% | 5.26 minutes | 25.9 seconds |

🛠️ Techniques for High Availability

🔄 Redundancy

Eliminate single points of failure

Active-Passive: Standby components only activate on failure
Active-Active: All components actively handle traffic

⚖️ Load Balancing

Distribute load across instances

Round Robin: Sequential distribution of requests
Least Connections: Route to server with fewest active connections

📈 Auto-scaling

Automatically adjust capacity based on demand

Target Tracking: Maintain metric (e.g., CPU) at target value
Step Scaling: Increase/decrease capacity in steps based on load

🩺 Health Monitoring

Detect and replace failed components

HTTP Checks: Monitor HTTP response status
TCP Checks: Monitor TCP connection ability
Custom Checks: Application-specific health checks

🌍 Geographic Distribution

Deploy resources in multiple locations

Active-Active: All regions active, traffic routed based on proximity
Active-Passive: Standby regions activated on failure

⚖️ Availability vs Consistency Trade-off

According to the CAP theorem, in the presence of a network partition:
  • Choose Availability: System remains operational but may serve stale data
  • Choose Consistency: System may become unavailable but serves correct data
Node 1
Node 2
Node 3
Available: Responds to requests, may return stale data
Consistent: Returns correct data, may not respond to all requests