URL Shortener

Designing a TinyURL-like service with billions of redirects per day.

Learning Objectives

By the end of this case study, you will understand:

  • Design high-throughput URL encoding and decoding systems
  • Implement efficient key generation strategies (base62, hash-based, counter-based)
  • Build globally distributed caching for 95%+ cache hit rates
  • Handle massive read-to-write ratios (1000:1) with proper data partitioning
  • Design analytics pipelines for click tracking and URL performance metrics

Real-World Examples

Bitly: Processes 600+ million links per month with 99.9% uptime, used by Nike, Disney, and BBC

TinyURL: One of the first URL shorteners (2002), handles millions of redirects daily with minimal infrastructure

t.co (Twitter): Processes billions of clicks, automatically shortens all URLs for security and analytics

short.link (Google): Powers YouTube video sharing and Google's internal link shortening needs

Learning Objectives

By the end of this case study, you will understand:

  • Design high-throughput URL encoding and decoding systems
  • Implement efficient key generation strategies (base62, hash-based, counter-based)
  • Build globally distributed caching for 95%+ cache hit rates
  • Handle massive read-to-write ratios (1000:1) with proper data partitioning
  • Design analytics pipelines for click tracking and URL performance metrics

Real-World Examples

Bitly: Processes 600+ million links per month with 99.9% uptime, used by Nike, Disney, and BBC

TinyURL: One of the first URL shorteners (2002), handles millions of redirects daily with minimal infrastructure

t.co (Twitter): Processes billions of clicks, automatically shortens all URLs for security and analytics

short.link (Google): Powers YouTube video sharing and Google's internal link shortening needs

Global routing, caching, app tier, keygen, KV, and analytics

Requirements

Functional Requirements

  • URL Shortening: Convert long URLs to unique short codes
  • URL Redirection: Redirect short URLs to original destinations
  • Custom Aliases: Allow users to create custom short codes
  • Expiration: Support time-based URL expiration
  • Analytics: Track clicks, geographic data, referrers
  • User Management: Account creation and URL management
  • Bulk Operations: API for bulk URL shortening
  • Link Preview: Safe preview before redirecting

Non-Functional Requirements

  • Scale: 100M URLs created/month, 10B redirects/month
  • Latency: < 50ms P95 redirect latency globally
  • Availability: 99.99% uptime for redirect service
  • Read/Write Ratio: 1000:1 (heavily read-optimized)
  • Storage: 10TB for 5 years of URL data
  • Cache Hit Rate: > 95% for popular URLs
  • Security: Prevent malicious URLs, rate limiting

Capacity Planning & Traffic Analysis

Write Traffic

  • 100M URLs/month = 38.5 URLs/second
  • Peak traffic (10x avg) = 385 URLs/second
  • Average URL size = 200 bytes
  • With metadata = 300 bytes per record

Read Traffic

  • 10B redirects/month = 3.86K redirects/second
  • Peak traffic (5x avg) = 19.3K redirects/second
  • Cache memory (1M hot URLs) = 300MB
  • 95% cache hit rate target

Storage Growth

  • Per year: 100M × 12 × 300B = 360GB
  • 5 years: 1.8TB raw data
  • With replication (3x): 5.4TB
  • Analytics data: ~2x URL data

High-Level Design

  • Edge/CDN for geo routing and cache hot redirects
  • App layer for create/redirect, rate limiting, auth
  • Primary KV store: code → long URL (+ TTL, flags)
  • ID generator for base62 codes; collision-safe

Key Components

  • API/App tier, WAF/Rate limiter
  • Key generator (Snowflake/KSUID or hash+retry)
  • Cache (Redis/Memcached) and persistent KV (Dynamo/Cassandra)
  • Analytics pipeline (Kafka → OLAP)

Capacity & Sizing

  • Assume 10M new URLs/day → ~115 QPS writes (peak 10×)
  • Reads 1000× writes → ~115k QPS (peak); cache target >95% hit ratio
  • Average URL 200 bytes; with metadata ~300 bytes/document
  • Storage per year ≈ 10M × 300B × 365 ≈ 1.1 TB (before replication)

Key Components

  • Code generation: Snowflake/KSUID or hash + collision handling
  • Storage: Redis/Memcache cache + persistent DB (Cassandra/DynamoDB)
  • Analytics pipeline (optional): Kafka → OLAP store (BigQuery/ClickHouse)

Data Model

Codes table + analytics events

  • codes (code PK, url, created_at, expire_at, owner_id, flags, hits)
  • owner_codes (owner_id, code, created_at) — for listing by user
  • events (event_id PK, code, ts, ip, ua, country) — optional analytics stream

APIs

  • Create shortened URL: POST /api/shorten with body { "url": "https://example.com/very/long", "custom": "mycode" }
  • Redirect: GET /:code
  • Owner list: GET /api/urls?owner=me

Response (create): { "code": "AbCd12", "shortUrl": "https://x.y/AbCd12", "expireAt": null }

Hot Path

  1. Client hits https://x.y/AbCd
  2. Edge cache lookup; on miss, forward to nearest region
  3. App reads cache → DB on miss; returns 301
  4. Async increment hit counter; stream event

Caching & TTL

  • Edge cache 301 responses (10–60 minutes); invalidate on update/delete
  • Conditional requests for previews via ETag

Caching & TTL

  • Edge cache 301 responses for active codes (e.g., 10–60 minutes)
  • Invalidate on update/delete; background warmup for top codes
  • Local app cache (LRU) to reduce DB tail latency

Scaling

  • Partition KV by code prefix; consistent hashing to distribute
  • Replicate multi-region; read local, write-through to home region
  • Asynchronous analytics to decouple read path

Trade-offs

  • Eventual consistency acceptable for analytics
  • Cache TTL vs purge on update
  • Collision probability vs code length

Failure Modes & Mitigations

  • DB outage → serve from cache with stale-if-error window
  • Hot keys → per-key rate limiting and targeted pre-warm
  • Keygen collision → retry with different salt/sequence

Observability

  • SLIs: redirect success rate, p95 latency, cache hit ratio
  • Error budgets and alerts for saturation and failures
  • Structured logs for redirects and creation events

URL Encoding Strategies

Base62 Counter

Advantages

  • Sequential, predictable length
  • No collisions by design
  • Compact encoding (6 chars = 56B combinations)

Disadvantages

  • Single point of failure (counter service)
  • Difficult to scale horizontally
  • URLs are predictable (security concern)

Hash-Based (MD5/SHA)

Advantages

  • Stateless generation
  • Distributed-friendly
  • Same URL produces same hash

Disadvantages

  • Potential collisions
  • Fixed length (may be longer than needed)
  • Need collision detection logic

Caching Strategy Deep Dive

L1: CDN/Edge Cache

  • TTL: 24 hours for hot URLs
  • Coverage: Top 10% of URLs (80% traffic)
  • Invalidation: API-triggered purge
  • Size: 100K URLs per edge location

L2: Application Cache (Redis)

  • TTL: 6 hours with LRU eviction
  • Coverage: Top 50% of URLs (95% traffic)
  • Size: 10M URLs (~3GB memory)
  • Replication: Redis Cluster with 3 replicas

L3: Database Read Replicas

  • Purpose: Cache misses and analytics queries
  • Replication Lag: < 1 second
  • Read Distribution: Round-robin load balancing
  • Fallback: Primary DB for consistency

Analytics Pipeline Architecture

1. Event Collection

  • Async event publishing to Kafka
  • Click events with: timestamp, IP, user-agent, referrer
  • Batch processing for high throughput
  • Guaranteed delivery with at-least-once semantics

2. Stream Processing

  • Real-time aggregation using Apache Flink/Kafka Streams
  • Geographical IP resolution for location analytics
  • Bot detection and filtering based on patterns
  • Windowed aggregations (1min, 1hr, 1day)

3. Data Storage

  • ClickHouse for fast analytical queries
  • Partitioning by date for efficient time-range queries
  • Materialized views for common aggregations
  • Data retention: 2 years with compression

Security Considerations

🛡️ Rate Limiting

  • IP-based: 100 requests/minute for creation
  • User-based: 1000 URLs/day for authenticated users
  • Sliding window with Redis counters
  • Progressive penalties for repeat offenders

🔍 URL Validation

  • Malware scanning integration (VirusTotal API)
  • Phishing domain blacklist checking
  • URL format validation and sanitization
  • Recursive shortener detection

🔐 Access Control

  • JWT tokens for authenticated operations
  • API key management for enterprise clients
  • URL ownership validation for modifications
  • HTTPS enforcement for all endpoints

Best Practices

  • Design for cache-first architecture: 95%+ cache hit rate is critical for performance
  • Implement proper URL validation and sanitization to prevent malicious redirects
  • Use CDN with geographic distribution for global low-latency redirects
  • Design analytics as a separate service to avoid impacting redirect performance
  • Implement gradual key expiration and cleanup to manage storage costs

Common Pitfalls

  • Not handling key collisions properly - can lead to data corruption or infinite loops
  • Poor cache warming strategy leading to cache misses during traffic spikes
  • Insufficient URL validation allowing redirect to malicious sites
  • Not implementing proper rate limiting - vulnerable to abuse and DDoS
  • Storing analytics data synchronously - impacts redirect latency significantly