Reliability

How it works

Reliability Overview
How it works
  1. Define SLOs/SLIs
  2. Add redundancy and timeouts
  3. Test failures/chaos
  4. Automate recovery and alerts

Overview

  • Design to tolerate faults and recover quickly; prioritize user-visible correctness.
  • Measure with SLIs (availability, latency, error rate) against SLOs; manage risk via error budgets.
  • Techniques: redundancy, timeouts/retries/backoff, isolation (bulkheads), graceful degradation, backups/DR.

When to use

  • Critical user flows where downtime or data corruption is costly (payments, auth, checkout).
  • Systems with strict SLAs/compliance requirements.
  • Platforms with many dependencies where partial failures are common.

Trade-offs

  • Extra cost for redundancy (multi-AZ/region), added complexity and latency from safety mechanisms.
  • Over-engineering can slow delivery; balance with error budgets and stage rollouts.

Patterns

  • Health checks + autoscaling; circuit breakers + bulkheads.
  • Timeouts with bounded retries and jitter; idempotency for safe replays.
  • Read replicas and automated failover; backups with regular restore tests.
  • Chaos testing/game days; runbooks and automation for common incidents.

Anti-patterns

  • Infinite retries without backoff or idempotency.
  • Single points of failure; untested backups/DR; manual-only recovery.
  • Coupled deploys creating cascading failures.

📐 Quick Diagram


      Client ▶ LB ▶ App A|B ▶ DB(primary)
                    └──▶ Cache ▶ Fallback reads on DB fail
      

❓ Interview Q&A (concise)

  • Q: Reliability vs Availability? A: Reliability is correctness over time; availability is readiness to serve. You can be available but unreliable, and vice versa.
  • Q: What is an error budget? A: Allowable unreliability derived from SLO; it guides release pace and risk acceptance.
  • Q: Avoid cascading failures? A: Timeouts, circuit breakers, bulkheads, load shedding, and backpressure.

🎯 What is Reliability?

Reliability is like having a safety net for your system - it ensures that even when things go wrong (hardware failures, software bugs, human errors), your system keeps on ticking. A reliable system minimizes downtime and data loss, and maintains consistent performance.

🚨 Failure Scenario

A hardware component fails...

⬇️
⬇️
⬇️
🖥️ Server A
🖥️ Server B (Standby)
🖥️ Server C (Standby)

...traffic is rerouted to healthy components.

🏗️ Key Concepts

⚠️ Fault vs Failure

Fault: A component deviating from its specification

Failure: System stops providing required service

🔄 Redundancy

Having multiple instances of critical components

Active-Passive: Standby components only activate on failure
Active-Active: All components actively handle traffic

🛠️ Reliability Techniques

📂 Replication

Maintain multiple copies of data/services

Synchronous: Immediate consistency, higher latency
Asynchronous: Eventual consistency, lower latency

🔙 Backup and Recovery

Regular backups and a tested disaster recovery plan

Full Backup: Complete system snapshot
Incremental Backup: Only changed data since last backup

🩺 Health Checks

Regularly monitor system health and component status

HTTP Checks: Monitor HTTP response status
TCP Checks: Monitor TCP connection ability
Custom Checks: Application-specific health checks

⚡ Circuit Breakers

Prevent cascade failures by stopping retries after failures

Closed: Normal operation, requests go through
Open: Circuit is open, requests fail fast
Half-Open: Test if the issue is resolved, limited requests allowed

📉 Graceful Degradation

Reduce functionality instead of failing completely

Example: Serve cached pages when the database is down
Example: Disable non-essential features during high load

📊 Measuring Reliability

📈 Key Metrics

⏳ MTBF

Mean Time Between Failures

Average time between system failures

🛠️ MTTR

Mean Time To Repair

Average time to repair a failed component

🔑 Availability

Uptime percentage

Target 99.9%+, monitor downtime

🎯 Reliability Best Practices

🔄 Regular Backups: Automate and test backups
🩺 Health Monitoring: 24/7 monitoring and alerts
⚡ Circuit Breaker Pattern: Implement in critical paths
📉 Graceful Degradation: Plan for reduced functionality