Real-time Chat Service

Designing a WhatsApp/Slack style messaging backend.

Learning Objectives

By the end of this case study, you will understand:

  • Design WebSocket connections and real-time message delivery at scale
  • Implement message ordering and delivery guarantees in distributed systems
  • Build presence systems and typing indicators with efficient state management
  • Design group chat scaling with fanout strategies and message routing
  • Handle offline message delivery and push notification systems

Real-World Examples

WhatsApp: Handles 100+ billion messages daily across 2 billion users with 99.9% uptime

Slack: Processes millions of messages for 20+ million daily active users in team workspaces

Discord: Powers real-time voice and text chat for 150+ million monthly active users

Telegram: Delivers messages with military-grade encryption to 700+ million users globally

Gateway + Message service + Pub/Sub + Presence

Requirements

Functional Requirements

  • User Management: Registration, authentication, profile management
  • Real-time Messaging: Send/receive text messages instantly
  • Group Chats: Create groups, add/remove members, admin controls
  • Message Status: Sent, delivered, read receipts
  • Media Sharing: Images, videos, documents, voice messages
  • User Presence: Online, offline, last seen status
  • Message History: Persistent storage and retrieval
  • Notifications: Push notifications for offline users
  • Typing Indicators: Real-time typing status in conversations

Non-Functional Requirements

  • Scale: 500M users, 100M DAU, 40B messages/day
  • Latency: < 100ms message delivery in same region
  • Availability: 99.9% uptime (8.7 hours downtime/year)
  • Consistency: Eventually consistent for message ordering
  • Storage: 1PB+ message storage, 10PB+ media storage
  • Bandwidth: 10GB/s peak message throughput
  • Security: End-to-end encryption, user privacy

Capacity Planning & Calculations

Message Volume

  • 40B messages/day = 463K messages/second
  • Peak traffic (3x avg) = 1.4M messages/second
  • Average message size = 100 bytes
  • Storage per day = 40B × 100 bytes = 4TB/day

Connection Management

  • 100M DAU, 30% concurrent = 30M connections
  • WebSocket overhead = 2KB per connection
  • Total memory for connections = 60GB
  • Connections per server = 10K (memory efficient)

Storage Requirements

  • Messages: 4TB/day × 365 days = 1.5PB/year
  • Media: 10x text volume = 15PB/year
  • Indexes: 20% of message data = 300TB/year
  • Replication factor: 3x for durability

High-Level Design

  • Gateway: WebSocket/HTTP long-poll for real-time connections
  • Message service: persist messages; sequence per convo
  • Pub/Sub bus (Kafka/Pulsar) for fanout and notifications
  • Presence service; state in Redis with TTL

Capacity & Sizing

  • Concurrent sessions: target millions globally; shard gateways by session hash
  • Message rate: peak TPS, per-conversation ordering constraints
  • Storage: retention policy (e.g., 6–12 months), media offloaded to object storage

Key Components

  • Gateway (WS/HTTP), Auth, Session resumption
  • Message Service (idempotent writes, sequencing)
  • Pub/Sub (topics per conversation/user inbox)
  • Presence service with TTL-based keys
  • Notification fanout (push/email/SMS) for offline users

Non-functional Requirements

  • Availability: 99.99% for message send/receive paths
  • Latency: P95 deliver < 150ms intra-region
  • Durability: messages stored once- and idempotent writes
  • Scale: millions of concurrent sessions

Storage

  • Messages: partition by conversation, order by sequence
  • Index by user for inbox/unread counts
  • Object storage for media; signed URLs

Conversations, messages, receipts, and presence

  • conversations (convo_id PK, type, created_at)
  • participants (convo_id, user_id, role)
  • messages (convo_id, seq, sender_id, ts, body, media_url)
  • receipts (convo_id, seq, user_id, delivered_at, seen_at)

APIs

  • REST:
  • POST /api/conversations
  • POST /api/conversations/:id/messages
  • GET /api/conversations/:id/messages?after=seq
  • WebSocket events:
  • send_message { convoId, clientSeq, body }
  • message_ack { convoId, clientSeq, serverSeq }
  • presence_update { userId, status }

Hot Path

  1. Client → Gateway (auth session, assign convo sequence)
  2. Message Service write (idempotent by clientSeq), publish to bus
  3. Consumers push to online recipients; enqueue notifications for offline
  4. Ack round-trip, update receipts

Delivery Flow

  1. Client sends message → Gateway authenticates
  2. Message service writes and publishes event
  3. Consumers push to recipients; offline → notifications
  4. Acks update delivery/seen states

Scaling

  • Shard gateways by consistent hashing of session IDs
  • Partition topics by conversation; compact logs for retention
  • Backpressure with bounded queues and drop policies for presence

Caching & TTL

  • Short TTL presence entries (e.g., 30–120s) in Redis
  • Cache recent message indices per conversation for fast pagination
  • Edge cache media via signed URLs with appropriate TTL

Trade-offs

  • Ordering per conversation vs global ordering
  • At-least-once delivery with idempotent writes
  • Backpressure on gateway during spikes

Failure Modes & Mitigations

  • Gateway crash → client reconnect with session resumption tokens
  • Consumer lag → scale consumers, enable catch-up reads
  • Out-of-order delivery → per-conversation sequencing + de-dup

Observability

  • SLIs: send→ack latency, drop rate, WS reconnect rate
  • Tracing from gateway to message persistence to fanout
  • Dashboards: lag, partitions, consumer health

Implementation Notes

  • Use WebSocket multiplexing to handle multiple conversations per connection
  • Implement message sequencing with conversation-specific counters or timestamps
  • Design heartbeat mechanisms to detect connection drops and handle reconnection
  • Use message deduplication with idempotency keys to prevent duplicate sends
  • Implement exponential backoff for message retry logic and delivery failures

Best Practices

  • Design for eventual consistency: prioritize availability over strict ordering across groups
  • Implement proper connection pooling and load balancing for WebSocket gateways
  • Use pub/sub patterns for efficient message fanout to multiple recipients
  • Design offline message storage with appropriate retention policies
  • Implement comprehensive monitoring for message delivery SLAs and connection health

Common Pitfalls

  • Not handling WebSocket connection drops gracefully - leads to message loss
  • Poor message ordering implementation causing out-of-sequence delivery
  • Insufficient rate limiting allowing spam and DoS attacks
  • Not implementing proper presence management leading to "ghost" online users
  • Synchronous processing of group message fanout causing latency spikes