Reliability
How it works
- Define SLOs/SLIs
- Add redundancy and timeouts
- Test failures/chaos
- Automate recovery and alerts
Overview
- Design to tolerate faults and recover quickly; prioritize user-visible correctness.
- Measure with SLIs (availability, latency, error rate) against SLOs; manage risk via error budgets.
- Techniques: redundancy, timeouts/retries/backoff, isolation (bulkheads), graceful degradation, backups/DR.
When to use
- Critical user flows where downtime or data corruption is costly (payments, auth, checkout).
- Systems with strict SLAs/compliance requirements.
- Platforms with many dependencies where partial failures are common.
Trade-offs
- Extra cost for redundancy (multi-AZ/region), added complexity and latency from safety mechanisms.
- Over-engineering can slow delivery; balance with error budgets and stage rollouts.
Patterns
- Health checks + autoscaling; circuit breakers + bulkheads.
- Timeouts with bounded retries and jitter; idempotency for safe replays.
- Read replicas and automated failover; backups with regular restore tests.
- Chaos testing/game days; runbooks and automation for common incidents.
Anti-patterns
- Infinite retries without backoff or idempotency.
- Single points of failure; untested backups/DR; manual-only recovery.
- Coupled deploys creating cascading failures.
📐 Quick Diagram
Client ▶ LB ▶ App A|B ▶ DB(primary)
└──▶ Cache ▶ Fallback reads on DB fail
❓ Interview Q&A (concise)
- Q: Reliability vs Availability? A: Reliability is correctness over time; availability is readiness to serve. You can be available but unreliable, and vice versa.
- Q: What is an error budget? A: Allowable unreliability derived from SLO; it guides release pace and risk acceptance.
- Q: Avoid cascading failures? A: Timeouts, circuit breakers, bulkheads, load shedding, and backpressure.
🎯 What is Reliability?
Reliability is like having a safety net for your system - it ensures that even when things go wrong (hardware failures, software bugs, human errors), your system keeps on ticking. A reliable system minimizes downtime and data loss, and maintains consistent performance.🚨 Failure Scenario
A hardware component fails...
⬇️
⬇️
⬇️
🖥️ Server A
🖥️ Server B (Standby)
🖥️ Server C (Standby)
...traffic is rerouted to healthy components.
🏗️ Key Concepts
⚠️ Fault vs Failure
Fault: A component deviating from its specification
Failure: System stops providing required service
🔄 Redundancy
Having multiple instances of critical components
Active-Passive: Standby components only activate on failure
Active-Active: All components actively handle traffic
🛠️ Reliability Techniques
📂 Replication
Maintain multiple copies of data/services
Synchronous: Immediate consistency, higher latency
Asynchronous: Eventual consistency, lower latency
🔙 Backup and Recovery
Regular backups and a tested disaster recovery plan
Full Backup: Complete system snapshot
Incremental Backup: Only changed data since last backup
🩺 Health Checks
Regularly monitor system health and component status
HTTP Checks: Monitor HTTP response status
TCP Checks: Monitor TCP connection ability
Custom Checks: Application-specific health checks
⚡ Circuit Breakers
Prevent cascade failures by stopping retries after failures
Closed: Normal operation, requests go through
Open: Circuit is open, requests fail fast
Half-Open: Test if the issue is resolved, limited requests allowed
📉 Graceful Degradation
Reduce functionality instead of failing completely
Example: Serve cached pages when the database is down
Example: Disable non-essential features during high load
📊 Measuring Reliability
📈 Key Metrics
⏳ MTBF
Mean Time Between Failures
Average time between system failures
🛠️ MTTR
Mean Time To Repair
Average time to repair a failed component
🔑 Availability
Uptime percentage
Target 99.9%+, monitor downtime
🎯 Reliability Best Practices
🔄 Regular Backups: Automate and test backups
🩺 Health Monitoring: 24/7 monitoring and alerts
⚡ Circuit Breaker Pattern: Implement in critical paths
📉 Graceful Degradation: Plan for reduced functionality