Availability
How it works
- Eliminate SPOFs
- Add redundancy/LB
- Design failover
- Measure uptime and MTTR
Overview
- Availability is the percentage of time a system is up and able to serve requests.
- Achieve via eliminating single points of failure, redundancy, health checks, and automated failover.
- Measured with uptime SLI and MTTR/MTBF; governed by SLOs.
When to use
- Customer-facing services with strict uptime commitments (e.g., 99.9%+).
- Systems with spiky/variable load where autoscaling and failover are required.
Trade-offs
- Cost of redundancy and multi-region deployments; complexity of stateful failover.
- Consistency trade-offs under partitions (CAP) may reduce data freshness.
Patterns
- Active-active across zones/regions with global load balancing.
- Health probes + automated instance replacement; rolling deploys; blue/green.
- Data replication and read replicas; graceful degradation and load shedding under stress.
Anti-patterns
- Single AZ deployment; manual-only failovers; no capacity headroom.
- Long timeouts causing resource exhaustion; lack of backpressure during incidents.
📐 Quick Diagram
Client ▶ Anycast DNS ▶ GSLB ▶ Region A (active) │ Region B (active)
LB ▶ App ▶ DB LB ▶ App ▶ DB
❓ Interview Q&A (concise)
- Q: Calculate downtime for 99.9%? A: ~8.76 hours/year; 99.99% ≈ 52.6 minutes.
- Q: Improve availability fast? A: Remove SPOFs, add health checks + autoscaling, shorten MTTR with automation.
- Q: CAP implications? A: During partitions you often choose AP for read availability or CP for strict correctness.
🎯 What is Availability?
Availability is like the reliability of a restaurant - it's the percentage of time the restaurant is open and serving customers. For systems, it's about being operational and accessible.99.9%
0.1%
Operational
Downtime
📊 Availability Levels
| Availability | Downtime per Year | Downtime per Month | |-------------|-------------------|-------------------| | 90% | 36.5 days | 3 days | | 99% | 3.65 days | 7.2 hours | | 99.9% | 8.76 hours | 43.2 minutes | | 99.99% | 52.56 minutes | 4.32 minutes | | 99.999% | 5.26 minutes | 25.9 seconds |🛠️ Techniques for High Availability
🔄 Redundancy
Eliminate single points of failure
Active-Passive: Standby components only activate on failure
Active-Active: All components actively handle traffic
⚖️ Load Balancing
Distribute load across instances
Round Robin: Sequential distribution of requests
Least Connections: Route to server with fewest active connections
📈 Auto-scaling
Automatically adjust capacity based on demand
Target Tracking: Maintain metric (e.g., CPU) at target value
Step Scaling: Increase/decrease capacity in steps based on load
🩺 Health Monitoring
Detect and replace failed components
HTTP Checks: Monitor HTTP response status
TCP Checks: Monitor TCP connection ability
Custom Checks: Application-specific health checks
🌍 Geographic Distribution
Deploy resources in multiple locations
Active-Active: All regions active, traffic routed based on proximity
Active-Passive: Standby regions activated on failure
⚖️ Availability vs Consistency Trade-off
According to the CAP theorem, in the presence of a network partition:- Choose Availability: System remains operational but may serve stale data
- Choose Consistency: System may become unavailable but serves correct data
Node 1
Node 2
Node 3
Available: Responds to requests, may return stale data
Consistent: Returns correct data, may not respond to all requests