URL Shortener
Designing a TinyURL-like service with billions of redirects per day.
Learning Objectives
By the end of this case study, you will understand:
- Design high-throughput URL encoding and decoding systems
- Implement efficient key generation strategies (base62, hash-based, counter-based)
- Build globally distributed caching for 95%+ cache hit rates
- Handle massive read-to-write ratios (1000:1) with proper data partitioning
- Design analytics pipelines for click tracking and URL performance metrics
Real-World Examples
Bitly: Processes 600+ million links per month with 99.9% uptime, used by Nike, Disney, and BBC
TinyURL: One of the first URL shorteners (2002), handles millions of redirects daily with minimal infrastructure
t.co (Twitter): Processes billions of clicks, automatically shortens all URLs for security and analytics
short.link (Google): Powers YouTube video sharing and Google's internal link shortening needs
Learning Objectives
By the end of this case study, you will understand:
- Design high-throughput URL encoding and decoding systems
- Implement efficient key generation strategies (base62, hash-based, counter-based)
- Build globally distributed caching for 95%+ cache hit rates
- Handle massive read-to-write ratios (1000:1) with proper data partitioning
- Design analytics pipelines for click tracking and URL performance metrics
Real-World Examples
Bitly: Processes 600+ million links per month with 99.9% uptime, used by Nike, Disney, and BBC
TinyURL: One of the first URL shorteners (2002), handles millions of redirects daily with minimal infrastructure
t.co (Twitter): Processes billions of clicks, automatically shortens all URLs for security and analytics
short.link (Google): Powers YouTube video sharing and Google's internal link shortening needs
Global routing, caching, app tier, keygen, KV, and analytics
Requirements
Functional Requirements
- URL Shortening: Convert long URLs to unique short codes
- URL Redirection: Redirect short URLs to original destinations
- Custom Aliases: Allow users to create custom short codes
- Expiration: Support time-based URL expiration
- Analytics: Track clicks, geographic data, referrers
- User Management: Account creation and URL management
- Bulk Operations: API for bulk URL shortening
- Link Preview: Safe preview before redirecting
Non-Functional Requirements
- Scale: 100M URLs created/month, 10B redirects/month
- Latency: < 50ms P95 redirect latency globally
- Availability: 99.99% uptime for redirect service
- Read/Write Ratio: 1000:1 (heavily read-optimized)
- Storage: 10TB for 5 years of URL data
- Cache Hit Rate: > 95% for popular URLs
- Security: Prevent malicious URLs, rate limiting
Capacity Planning & Traffic Analysis
Write Traffic
- 100M URLs/month = 38.5 URLs/second
- Peak traffic (10x avg) = 385 URLs/second
- Average URL size = 200 bytes
- With metadata = 300 bytes per record
Read Traffic
- 10B redirects/month = 3.86K redirects/second
- Peak traffic (5x avg) = 19.3K redirects/second
- Cache memory (1M hot URLs) = 300MB
- 95% cache hit rate target
Storage Growth
- Per year: 100M × 12 × 300B = 360GB
- 5 years: 1.8TB raw data
- With replication (3x): 5.4TB
- Analytics data: ~2x URL data
High-Level Design
- Edge/CDN for geo routing and cache hot redirects
- App layer for create/redirect, rate limiting, auth
- Primary KV store: code → long URL (+ TTL, flags)
- ID generator for base62 codes; collision-safe
Key Components
- API/App tier, WAF/Rate limiter
- Key generator (Snowflake/KSUID or hash+retry)
- Cache (Redis/Memcached) and persistent KV (Dynamo/Cassandra)
- Analytics pipeline (Kafka → OLAP)
Capacity & Sizing
- Assume 10M new URLs/day → ~115 QPS writes (peak 10×)
- Reads 1000× writes → ~115k QPS (peak); cache target >95% hit ratio
- Average URL 200 bytes; with metadata ~300 bytes/document
- Storage per year ≈ 10M × 300B × 365 ≈ 1.1 TB (before replication)
Key Components
- Code generation: Snowflake/KSUID or hash + collision handling
- Storage: Redis/Memcache cache + persistent DB (Cassandra/DynamoDB)
- Analytics pipeline (optional): Kafka → OLAP store (BigQuery/ClickHouse)
Data Model
Codes table + analytics events
- codes (
codePK,url,created_at,expire_at,owner_id,flags,hits) - owner_codes (
owner_id,code,created_at) — for listing by user - events (
event_idPK,code,ts,ip,ua,country) — optional analytics stream
APIs
- Create shortened URL:
POST /api/shortenwith body{ "url": "https://example.com/very/long", "custom": "mycode" } - Redirect:
GET /:code - Owner list:
GET /api/urls?owner=me
Response (create): { "code": "AbCd12", "shortUrl": "https://x.y/AbCd12", "expireAt": null }
Hot Path
- Client hits https://x.y/AbCd
- Edge cache lookup; on miss, forward to nearest region
- App reads cache → DB on miss; returns 301
- Async increment hit counter; stream event
Caching & TTL
- Edge cache 301 responses (10–60 minutes); invalidate on update/delete
- Conditional requests for previews via ETag
Caching & TTL
- Edge cache 301 responses for active codes (e.g., 10–60 minutes)
- Invalidate on update/delete; background warmup for top codes
- Local app cache (LRU) to reduce DB tail latency
Scaling
- Partition KV by code prefix; consistent hashing to distribute
- Replicate multi-region; read local, write-through to home region
- Asynchronous analytics to decouple read path
Trade-offs
- Eventual consistency acceptable for analytics
- Cache TTL vs purge on update
- Collision probability vs code length
Failure Modes & Mitigations
- DB outage → serve from cache with stale-if-error window
- Hot keys → per-key rate limiting and targeted pre-warm
- Keygen collision → retry with different salt/sequence
Observability
- SLIs: redirect success rate, p95 latency, cache hit ratio
- Error budgets and alerts for saturation and failures
- Structured logs for redirects and creation events
URL Encoding Strategies
Base62 Counter
Advantages
- Sequential, predictable length
- No collisions by design
- Compact encoding (6 chars = 56B combinations)
Disadvantages
- Single point of failure (counter service)
- Difficult to scale horizontally
- URLs are predictable (security concern)
Hash-Based (MD5/SHA)
Advantages
- Stateless generation
- Distributed-friendly
- Same URL produces same hash
Disadvantages
- Potential collisions
- Fixed length (may be longer than needed)
- Need collision detection logic
UUID/Snowflake (Recommended)
Advantages
- Guaranteed uniqueness across nodes
- Timestamp ordering capability
- High throughput generation
Disadvantages
- Slightly longer codes
- Requires node coordination
- More complex implementation
Caching Strategy Deep Dive
L1: CDN/Edge Cache
- TTL: 24 hours for hot URLs
- Coverage: Top 10% of URLs (80% traffic)
- Invalidation: API-triggered purge
- Size: 100K URLs per edge location
L2: Application Cache (Redis)
- TTL: 6 hours with LRU eviction
- Coverage: Top 50% of URLs (95% traffic)
- Size: 10M URLs (~3GB memory)
- Replication: Redis Cluster with 3 replicas
L3: Database Read Replicas
- Purpose: Cache misses and analytics queries
- Replication Lag: < 1 second
- Read Distribution: Round-robin load balancing
- Fallback: Primary DB for consistency
Analytics Pipeline Architecture
1. Event Collection
- Async event publishing to Kafka
- Click events with: timestamp, IP, user-agent, referrer
- Batch processing for high throughput
- Guaranteed delivery with at-least-once semantics
2. Stream Processing
- Real-time aggregation using Apache Flink/Kafka Streams
- Geographical IP resolution for location analytics
- Bot detection and filtering based on patterns
- Windowed aggregations (1min, 1hr, 1day)
3. Data Storage
- ClickHouse for fast analytical queries
- Partitioning by date for efficient time-range queries
- Materialized views for common aggregations
- Data retention: 2 years with compression
Security Considerations
🛡️ Rate Limiting
- IP-based: 100 requests/minute for creation
- User-based: 1000 URLs/day for authenticated users
- Sliding window with Redis counters
- Progressive penalties for repeat offenders
🔍 URL Validation
- Malware scanning integration (VirusTotal API)
- Phishing domain blacklist checking
- URL format validation and sanitization
- Recursive shortener detection
🔐 Access Control
- JWT tokens for authenticated operations
- API key management for enterprise clients
- URL ownership validation for modifications
- HTTPS enforcement for all endpoints
Best Practices
- Design for cache-first architecture: 95%+ cache hit rate is critical for performance
- Implement proper URL validation and sanitization to prevent malicious redirects
- Use CDN with geographic distribution for global low-latency redirects
- Design analytics as a separate service to avoid impacting redirect performance
- Implement gradual key expiration and cleanup to manage storage costs
Common Pitfalls
- Not handling key collisions properly - can lead to data corruption or infinite loops
- Poor cache warming strategy leading to cache misses during traffic spikes
- Insufficient URL validation allowing redirect to malicious sites
- Not implementing proper rate limiting - vulnerable to abuse and DDoS
- Storing analytics data synchronously - impacts redirect latency significantly