Image Processing Service

S3-like object storage with high durability, replication, consistency, and lifecycle management.

Learning Objectives

By the end of this case study, you will understand:

  • Design durable storage systems with 99.999999999% (11 9s) durability
  • Implement multi-AZ replication and erasure coding strategies
  • Build scalable metadata management for billions of objects
  • Design efficient multipart upload and range read mechanisms
  • Implement lifecycle management and tiered storage cost optimization

Real-World Examples

Amazon S3: 100+ trillion objects, processes millions of requests per second

Google Cloud Storage: Powers YouTube video storage and Gmail attachments

Azure Blob Storage: Handles Microsoft Office 365 document storage

Cloudflare R2: Zero egress fees, compatible with S3 API

Requirements

Functional Requirements

  • Create, read, update (overwrite), delete objects
  • Multipart uploads, range reads, pre-signed URLs
  • Bucket-level ACLs/policies and lifecycle rules

Non-functional Requirements

  • Durability 11+ nines with multi-AZ replication
  • Low read latency at scale; high throughput writes
  • Cost efficiency with tiered storage

High-Level Design

  • API layer for authN/Z and request handling
  • Metadata DB for bucket/object metadata
  • Object storage with erasure coding and replication
  • Background replicators and repair jobs

Capacity & Sizing

  • Est. object count, average size, requests/sec for PUT/GET
  • Plan for replication factor or erasure coding overhead
  • Throughput of storage nodes and network egress

Key Components

  • Auth/IAM, API Gateway
  • Metadata Store (SQL/NoSQL)
  • Chunk store (object data) with EC/replication
  • Replicators, lifecycle workers

Architecture

High-level components and data flow

Data Model

Core entities and relationships

  • buckets (bucket PK, owner_id, region, policy_json)
  • objects (bucket, key PK, version, size, etag, storage_class)
  • parts (bucket, key, upload_id, part_no, etag, size)

APIs

  • PUT /:bucket/:key (multipart supported)
  • GET /:bucket/:key?version=... (range, conditional)
  • DELETE /:bucket/:key?version=...

Hot Path

  1. Upload path: init → upload parts → commit manifest
  2. Download path: authorize → resolve version → stream/range

Caching & TTL

  • Edge cache signed GETs for public objects with appropriate Cache-Control
  • Conditional requests with ETag/If-None-Match to reduce egress

Scaling

  • Shard metadata DB by bucket; object data on erasure-coded chunks
  • Tiered storage (hot/warm/cold) with lifecycle transitions
  • Edge acceleration with signed URLs

Trade-offs

  • Strong vs eventual consistency on overwrite
  • Erasure coding vs replication cost/latency
  • Small object overhead vs large object throughput

Failure Modes & Mitigations

  • Chunk loss → background repair
  • Hot buckets → adaptive throttling and partitioning
  • Cross-AZ latency spikes → queue and backpressure

Implementation Notes

  • Use consistent hashing for object placement across storage nodes
  • Implement bloom filters to reduce disk seeks for non-existent objects
  • Design efficient manifest format for multipart uploads
  • Use content-addressable storage to eliminate duplicate data
  • Implement background compaction to merge small objects

Best Practices

  • Design for failure: assume disks and nodes will fail regularly
  • Implement gradual rollout for schema changes and new features
  • Use circuit breakers to prevent cascade failures
  • Monitor storage efficiency and implement compression where beneficial
  • Design APIs to be backward compatible and versioned

Common Pitfalls

  • Not handling small object overhead - billions of tiny files hurt performance
  • Insufficient monitoring of cross-AZ replication lag during outages
  • Poor hot bucket partitioning leading to storage node hotspots
  • Not implementing proper backpressure during high load periods
  • Inadequate testing of disaster recovery and cross-region failover

Observability

  • SLIs: PUT/GET error rate, p95 latency, durability repair backlog
  • Audit logs for access; storage utilization dashboards
  • Replicator lag and retry metrics