Availability

How it works

Overview

Availability is the percentage of time a system is up and able to serve requests.
Achieve via eliminating single points of failure, redundancy, health checks, and automated failover.
Measured with uptime SLI and MTTR/MTBF; governed by SLOs.

When to use

Customer-facing services with strict uptime commitments (e.g., 99.9%+).
Systems with spiky/variable load where autoscaling and failover are required.

Trade-offs

Cost of redundancy and multi-region deployments; complexity of stateful failover.
Consistency trade-offs under partitions (CAP) may reduce data freshness.

Patterns

Active-active across zones/regions with global load balancing.
Health probes + automated instance replacement; rolling deploys; blue/green.
Data replication and read replicas; graceful degradation and load shedding under stress.

Anti-patterns

Single AZ deployment; manual-only failovers; no capacity headroom.
Long timeouts causing resource exhaustion; lack of backpressure during incidents.

📐 Quick Diagram


      Client ▶ Anycast DNS ▶ GSLB ▶ Region A (active) │ Region B (active)
                                     LB ▶ App ▶ DB     LB ▶ App ▶ DB

❓ Interview Q&A (concise)

Q: Calculate downtime for 99.9%? A: ~8.76 hours/year; 99.99% ≈ 52.6 minutes.
Q: Improve availability fast? A: Remove SPOFs, add health checks + autoscaling, shorten MTTR with automation.
Q: CAP implications? A: During partitions you often choose AP for read availability or CP for strict correctness.

🎯 What is Availability?

Availability is like the reliability of a restaurant - it's the percentage of time the restaurant is open and serving customers. For systems, it's about being operational and accessible.

99.9%

0.1%

Operational

Downtime

📊 Availability Levels

| Availability | Downtime per Year | Downtime per Month | |-------------|-------------------|-------------------| | 90% | 36.5 days | 3 days | | 99% | 3.65 days | 7.2 hours | | 99.9% | 8.76 hours | 43.2 minutes | | 99.99% | 52.56 minutes | 4.32 minutes | | 99.999% | 5.26 minutes | 25.9 seconds |

🛠️ Techniques for High Availability

🔄 Redundancy

Eliminate single points of failure

Active-Passive: Standby components only activate on failure

Active-Active: All components actively handle traffic

⚖️ Load Balancing

Distribute load across instances

Round Robin: Sequential distribution of requests

Least Connections: Route to server with fewest active connections

📈 Auto-scaling

Automatically adjust capacity based on demand

Target Tracking: Maintain metric (e.g., CPU) at target value

Step Scaling: Increase/decrease capacity in steps based on load

🩺 Health Monitoring

Detect and replace failed components

HTTP Checks: Monitor HTTP response status

TCP Checks: Monitor TCP connection ability

Custom Checks: Application-specific health checks

🌍 Geographic Distribution

Deploy resources in multiple locations

Active-Active: All regions active, traffic routed based on proximity

Active-Passive: Standby regions activated on failure

⚖️ Availability vs Consistency Trade-off

According to the CAP theorem, in the presence of a network partition:

Choose Availability: System remains operational but may serve stale data
Choose Consistency: System may become unavailable but serves correct data

Node 1

Node 2

Node 3

Available: Responds to requests, may return stale data

Consistent: Returns correct data, may not respond to all requests