Reliability

How it works

Overview

Design to tolerate faults and recover quickly; prioritize user-visible correctness.
Measure with SLIs (availability, latency, error rate) against SLOs; manage risk via error budgets.
Techniques: redundancy, timeouts/retries/backoff, isolation (bulkheads), graceful degradation, backups/DR.

When to use

Critical user flows where downtime or data corruption is costly (payments, auth, checkout).
Systems with strict SLAs/compliance requirements.
Platforms with many dependencies where partial failures are common.

Trade-offs

Extra cost for redundancy (multi-AZ/region), added complexity and latency from safety mechanisms.
Over-engineering can slow delivery; balance with error budgets and stage rollouts.

Patterns

Health checks + autoscaling; circuit breakers + bulkheads.
Timeouts with bounded retries and jitter; idempotency for safe replays.
Read replicas and automated failover; backups with regular restore tests.
Chaos testing/game days; runbooks and automation for common incidents.

Anti-patterns

Infinite retries without backoff or idempotency.
Single points of failure; untested backups/DR; manual-only recovery.
Coupled deploys creating cascading failures.

📐 Quick Diagram


      Client ▶ LB ▶ App A|B ▶ DB(primary)
                    └──▶ Cache ▶ Fallback reads on DB fail

❓ Interview Q&A (concise)

Q: Reliability vs Availability? A: Reliability is correctness over time; availability is readiness to serve. You can be available but unreliable, and vice versa.
Q: What is an error budget? A: Allowable unreliability derived from SLO; it guides release pace and risk acceptance.
Q: Avoid cascading failures? A: Timeouts, circuit breakers, bulkheads, load shedding, and backpressure.

🎯 What is Reliability?

Reliability is like having a safety net for your system - it ensures that even when things go wrong (hardware failures, software bugs, human errors), your system keeps on ticking. A reliable system minimizes downtime and data loss, and maintains consistent performance.

🚨 Failure Scenario

A hardware component fails...

⬇️

🖥️ Server A

🖥️ Server B (Standby)

🖥️ Server C (Standby)

...traffic is rerouted to healthy components.

🏗️ Key Concepts

⚠️ Fault vs Failure

Fault: A component deviating from its specification

Failure: System stops providing required service

🔄 Redundancy

Having multiple instances of critical components

Active-Passive: Standby components only activate on failure

Active-Active: All components actively handle traffic

🛠️ Reliability Techniques

📂 Replication

Maintain multiple copies of data/services

Synchronous: Immediate consistency, higher latency

Asynchronous: Eventual consistency, lower latency

🔙 Backup and Recovery

Regular backups and a tested disaster recovery plan

Full Backup: Complete system snapshot

Incremental Backup: Only changed data since last backup

🩺 Health Checks

Regularly monitor system health and component status

HTTP Checks: Monitor HTTP response status

TCP Checks: Monitor TCP connection ability

Custom Checks: Application-specific health checks

⚡ Circuit Breakers

Prevent cascade failures by stopping retries after failures

Closed: Normal operation, requests go through

Open: Circuit is open, requests fail fast

Half-Open: Test if the issue is resolved, limited requests allowed

📉 Graceful Degradation

Reduce functionality instead of failing completely

Example: Serve cached pages when the database is down

Example: Disable non-essential features during high load

📊 Measuring Reliability

📈 Key Metrics

⏳ MTBF

Mean Time Between Failures

Average time between system failures

🛠️ MTTR

Mean Time To Repair

Average time to repair a failed component

🔑 Availability

Uptime percentage

Target 99.9%+, monitor downtime

🎯 Reliability Best Practices

🔄 Regular Backups: Automate and test backups

🩺 Health Monitoring: 24/7 monitoring and alerts

⚡ Circuit Breaker Pattern: Implement in critical paths

📉 Graceful Degradation: Plan for reduced functionality