Database Sharding

How it works

Database Sharding Overview
How it works
  1. Pick shard key
  2. Route by key/hash/range
  3. Rebalance as needed
  4. Monitor hotspots and skew

Overview

  • Horizontal partitioning of data across nodes to scale writes, storage, and reduce hotspots.
  • Core concerns: shard-key choice, routing, rebalancing, and cross-shard queries.

When to use

  • Dataset no longer fits on a single node or single-node write throughput is the bottleneck.
  • Clear partition key with limited cross-shard operations.

Trade-offs

  • Operational complexity (migrations, rebalancing), harder joins/transactions.
  • Hot keys and skew; resharding impacts.

Patterns

  • Consistent hashing for even distribution and easier scaling.
  • Directory/lookup service for flexible mapping; range vs hash hybrids.
  • Avoid cross-shard transactions; use async aggregation and per-shard materialized views.

Anti-patterns

  • Using user_id when most traffic concentrates on a few IDs (hot key).
  • Cross-shard JOINs in the hot path; schema that requires frequent fan-out.

📐 Quick Diagram


      App ▶ Router(hash/range) ▶ Shard A │ Shard B │ Shard C
                        └▶ Rebalance → move ranges/virtual nodes
      

❓ Interview Q&A (concise)

  • Q: Pick a shard key? A: Even distribution, locality for common queries, and stability across time.
  • Q: Add shards later? A: Use consistent hashing/virtual nodes; plan for online rebalancing.
  • Q: Cross-shard queries? A: Precompute, scatter-gather with limits, or keep them off the hot path.

🎯 What is Sharding?

Sharding is like dividing a large pizza into smaller, manageable slices - each slice (shard) can be served independently, making the whole system more scalable. Sharding helps in distributing the load and improving the performance of database systems.
🍕 Shard 1
🍕 Shard 2
🍕 Shard 3

🏗️ Sharding Strategies

📊 Range-Based Sharding

Partition data based on key ranges

  • Example: Users A-M on Shard 1, N-Z on Shard 2
  • Pros: Simple to implement
  • Cons: Uneven distribution, hot spots

🔢 Hash-Based Sharding

Use hash function on shard key

  • Example: hash(user_id) % num_shards
  • Pros: Even distribution
  • Cons: Difficult to rebalance

📚 Directory-Based Sharding

Lookup service maps keys to shards

  • Pros: Flexible, easy to rebalance
  • Cons: Additional complexity, single point of failure

🌍 Geographic Sharding

Partition by location

  • Pros: Reduces latency
  • Cons: Uneven load distribution

🚧 Challenges with Sharding

  1. Cross-Shard Queries: Complex joins across shards
  2. Rebalancing: Moving data when adding/removing shards
  3. Hot Spots: Uneven distribution of load
  4. Operational Complexity: Multiple databases to manage
  5. Shard Key Selection: Critical decision affecting performance

🛠️ Solutions and Best Practices

🔄 Consistent Hashing

Minimize data movement during rebalancing

📊 Shard Key Design

Choose keys that distribute load evenly

🚫 Avoid Cross-Shard Transactions

Design for single-shard operations

📈 Monitoring

Track shard performance and distribution

⚙️ Automation

Tools for shard management and rebalancing

Alternatives to Sharding

  • Read Replicas: Scale read operations
  • Caching: Reduce database load
  • NoSQL: Databases designed for horizontal scaling
  • Federation: Functional partitioning