Image Processing Service

S3-like object storage with high durability, replication, consistency, and lifecycle management.

Learning Objectives

By the end of this case study, you will understand:

Design durable storage systems with 99.999999999% (11 9s) durability
Implement multi-AZ replication and erasure coding strategies
Build scalable metadata management for billions of objects
Design efficient multipart upload and range read mechanisms
Implement lifecycle management and tiered storage cost optimization

Real-World Examples

Amazon S3: 100+ trillion objects, processes millions of requests per second

Google Cloud Storage: Powers YouTube video storage and Gmail attachments

Azure Blob Storage: Handles Microsoft Office 365 document storage

Cloudflare R2: Zero egress fees, compatible with S3 API

Requirements

Functional Requirements

Create, read, update (overwrite), delete objects
Multipart uploads, range reads, pre-signed URLs
Bucket-level ACLs/policies and lifecycle rules

Non-functional Requirements

Durability 11+ nines with multi-AZ replication
Low read latency at scale; high throughput writes
Cost efficiency with tiered storage

High-Level Design

API layer for authN/Z and request handling
Metadata DB for bucket/object metadata
Object storage with erasure coding and replication
Background replicators and repair jobs

Capacity & Sizing

Est. object count, average size, requests/sec for PUT/GET
Plan for replication factor or erasure coding overhead
Throughput of storage nodes and network egress

Key Components

Auth/IAM, API Gateway
Metadata Store (SQL/NoSQL)
Chunk store (object data) with EC/replication
Replicators, lifecycle workers

Architecture

High-level components and data flow

Data Model

Core entities and relationships

buckets (bucket PK, owner_id, region, policy_json)
objects (bucket, key PK, version, size, etag, storage_class)
parts (bucket, key, upload_id, part_no, etag, size)

APIs

PUT /:bucket/:key (multipart supported)
GET /:bucket/:key?version=... (range, conditional)
DELETE /:bucket/:key?version=...

Hot Path

Upload path: init → upload parts → commit manifest
Download path: authorize → resolve version → stream/range

Caching & TTL

Edge cache signed GETs for public objects with appropriate Cache-Control
Conditional requests with ETag/If-None-Match to reduce egress

Scaling

Shard metadata DB by bucket; object data on erasure-coded chunks
Tiered storage (hot/warm/cold) with lifecycle transitions
Edge acceleration with signed URLs

Trade-offs

Strong vs eventual consistency on overwrite
Erasure coding vs replication cost/latency
Small object overhead vs large object throughput

Failure Modes & Mitigations

Chunk loss → background repair
Hot buckets → adaptive throttling and partitioning
Cross-AZ latency spikes → queue and backpressure

Implementation Notes

Use consistent hashing for object placement across storage nodes
Implement bloom filters to reduce disk seeks for non-existent objects
Design efficient manifest format for multipart uploads
Use content-addressable storage to eliminate duplicate data
Implement background compaction to merge small objects

Best Practices

Design for failure: assume disks and nodes will fail regularly
Implement gradual rollout for schema changes and new features
Use circuit breakers to prevent cascade failures
Monitor storage efficiency and implement compression where beneficial
Design APIs to be backward compatible and versioned

Common Pitfalls

Not handling small object overhead - billions of tiny files hurt performance
Insufficient monitoring of cross-AZ replication lag during outages
Poor hot bucket partitioning leading to storage node hotspots
Not implementing proper backpressure during high load periods
Inadequate testing of disaster recovery and cross-region failover

Observability

SLIs: PUT/GET error rate, p95 latency, durability repair backlog
Audit logs for access; storage utilization dashboards
Replicator lag and retry metrics