File Storage Service
S3-like object storage with high durability, replication, consistency, and lifecycle management.
Learning Objectives
By the end of this case study, you will understand:
- Design durable storage systems with 99.999999999% (11 9s) durability
- Implement multi-AZ replication and erasure coding strategies
- Build scalable metadata management for billions of objects
- Design efficient multipart upload and range read mechanisms
- Implement lifecycle management and tiered storage cost optimization
Real-World Examples
Amazon S3: 100+ trillion objects, processes millions of requests per second
Google Cloud Storage: Powers YouTube video storage and Gmail attachments
Azure Blob Storage: Handles Microsoft Office 365 document storage
Cloudflare R2: Zero egress fees, compatible with S3 API
Requirements
Functional Requirements
- Create, read, update (overwrite), delete objects
- Multipart uploads, range reads, pre-signed URLs
- Bucket-level ACLs/policies and lifecycle rules
Non-functional Requirements
- Durability 11+ nines with multi-AZ replication
- Low read latency at scale; high throughput writes
- Cost efficiency with tiered storage
High-Level Design
- API layer for authN/Z and request handling
- Metadata DB for bucket/object metadata
- Object storage with erasure coding and replication
- Background replicators and repair jobs
Capacity & Sizing
- Est. object count, average size, requests/sec for PUT/GET
- Plan for replication factor or erasure coding overhead
- Throughput of storage nodes and network egress
Key Components
- Auth/IAM, API Gateway
- Metadata Store (SQL/NoSQL)
- Chunk store (object data) with EC/replication
- Replicators, lifecycle workers
Architecture
High-level components and data flow
Data Model
Core entities and relationships
- buckets (
bucket PK,owner_id,region,policy_json) - objects (
bucket,key PK,version,size,etag,storage_class) - parts (
bucket,key,upload_id,part_no,etag,size)
APIs
- PUT /:bucket/:key (multipart supported)
- GET /:bucket/:key?version=... (range, conditional)
- DELETE /:bucket/:key?version=...
Hot Path
- Upload path: init → upload parts → commit manifest
- Download path: authorize → resolve version → stream/range
Caching & TTL
- Edge cache signed GETs for public objects with appropriate Cache-Control
- Conditional requests with ETag/If-None-Match to reduce egress
Scaling
- Shard metadata DB by bucket; object data on erasure-coded chunks
- Tiered storage (hot/warm/cold) with lifecycle transitions
- Edge acceleration with signed URLs
Trade-offs
- Strong vs eventual consistency on overwrite
- Erasure coding vs replication cost/latency
- Small object overhead vs large object throughput
Failure Modes & Mitigations
- Chunk loss → background repair
- Hot buckets → adaptive throttling and partitioning
- Cross-AZ latency spikes → queue and backpressure
Implementation Notes
- Use consistent hashing for object placement across storage nodes
- Implement bloom filters to reduce disk seeks for non-existent objects
- Design efficient manifest format for multipart uploads
- Use content-addressable storage to eliminate duplicate data
- Implement background compaction to merge small objects
Best Practices
- Design for failure: assume disks and nodes will fail regularly
- Implement gradual rollout for schema changes and new features
- Use circuit breakers to prevent cascade failures
- Monitor storage efficiency and implement compression where beneficial
- Design APIs to be backward compatible and versioned
Common Pitfalls
- Not handling small object overhead - billions of tiny files hurt performance
- Insufficient monitoring of cross-AZ replication lag during outages
- Poor hot bucket partitioning leading to storage node hotspots
- Not implementing proper backpressure during high load periods
- Inadequate testing of disaster recovery and cross-region failover
Observability
- SLIs: PUT/GET error rate, p95 latency, durability repair backlog
- Audit logs for access; storage utilization dashboards
- Replicator lag and retry metrics