Real-time Chat Service

Designing a WhatsApp/Slack style messaging backend.

Learning Objectives

By the end of this case study, you will understand:

Design WebSocket connections and real-time message delivery at scale
Implement message ordering and delivery guarantees in distributed systems
Build presence systems and typing indicators with efficient state management
Design group chat scaling with fanout strategies and message routing
Handle offline message delivery and push notification systems

Real-World Examples

WhatsApp: Handles 100+ billion messages daily across 2 billion users with 99.9% uptime

Slack: Processes millions of messages for 20+ million daily active users in team workspaces

Discord: Powers real-time voice and text chat for 150+ million monthly active users

Telegram: Delivers messages with military-grade encryption to 700+ million users globally

Gateway + Message service + Pub/Sub + Presence

Requirements

Functional Requirements

User Management: Registration, authentication, profile management
Real-time Messaging: Send/receive text messages instantly
Group Chats: Create groups, add/remove members, admin controls
Message Status: Sent, delivered, read receipts
Media Sharing: Images, videos, documents, voice messages
User Presence: Online, offline, last seen status
Message History: Persistent storage and retrieval
Notifications: Push notifications for offline users
Typing Indicators: Real-time typing status in conversations

Non-Functional Requirements

Scale: 500M users, 100M DAU, 40B messages/day
Latency: < 100ms message delivery in same region
Availability: 99.9% uptime (8.7 hours downtime/year)
Consistency: Eventually consistent for message ordering
Storage: 1PB+ message storage, 10PB+ media storage
Bandwidth: 10GB/s peak message throughput
Security: End-to-end encryption, user privacy

Capacity Planning & Calculations

Message Volume

40B messages/day = 463K messages/second
Peak traffic (3x avg) = 1.4M messages/second
Average message size = 100 bytes
Storage per day = 40B × 100 bytes = 4TB/day

Connection Management

100M DAU, 30% concurrent = 30M connections
WebSocket overhead = 2KB per connection
Total memory for connections = 60GB
Connections per server = 10K (memory efficient)

Storage Requirements

Messages: 4TB/day × 365 days = 1.5PB/year
Media: 10x text volume = 15PB/year
Indexes: 20% of message data = 300TB/year
Replication factor: 3x for durability

High-Level Design

Gateway: WebSocket/HTTP long-poll for real-time connections
Message service: persist messages; sequence per convo
Pub/Sub bus (Kafka/Pulsar) for fanout and notifications
Presence service; state in Redis with TTL

Capacity & Sizing

Concurrent sessions: target millions globally; shard gateways by session hash
Message rate: peak TPS, per-conversation ordering constraints
Storage: retention policy (e.g., 6–12 months), media offloaded to object storage

Key Components

Gateway (WS/HTTP), Auth, Session resumption
Message Service (idempotent writes, sequencing)
Pub/Sub (topics per conversation/user inbox)
Presence service with TTL-based keys
Notification fanout (push/email/SMS) for offline users

Non-functional Requirements

Availability: 99.99% for message send/receive paths
Latency: P95 deliver < 150ms intra-region
Durability: messages stored once- and idempotent writes
Scale: millions of concurrent sessions

Storage

Messages: partition by conversation, order by sequence
Index by user for inbox/unread counts
Object storage for media; signed URLs

Conversations, messages, receipts, and presence

conversations (convo_id PK, type, created_at)
participants (convo_id, user_id, role)
messages (convo_id, seq, sender_id, ts, body, media_url)
receipts (convo_id, seq, user_id, delivered_at, seen_at)

APIs

REST:
POST /api/conversations
POST /api/conversations/:id/messages
GET /api/conversations/:id/messages?after=seq
WebSocket events:
send_message { convoId, clientSeq, body }
message_ack { convoId, clientSeq, serverSeq }
presence_update { userId, status }

Hot Path

Client → Gateway (auth session, assign convo sequence)
Message Service write (idempotent by clientSeq), publish to bus
Consumers push to online recipients; enqueue notifications for offline
Ack round-trip, update receipts

Delivery Flow

Client sends message → Gateway authenticates
Message service writes and publishes event
Consumers push to recipients; offline → notifications
Acks update delivery/seen states

Scaling

Shard gateways by consistent hashing of session IDs
Partition topics by conversation; compact logs for retention
Backpressure with bounded queues and drop policies for presence

Caching & TTL

Short TTL presence entries (e.g., 30–120s) in Redis
Cache recent message indices per conversation for fast pagination
Edge cache media via signed URLs with appropriate TTL

Trade-offs

Ordering per conversation vs global ordering
At-least-once delivery with idempotent writes
Backpressure on gateway during spikes

Failure Modes & Mitigations

Gateway crash → client reconnect with session resumption tokens
Consumer lag → scale consumers, enable catch-up reads
Out-of-order delivery → per-conversation sequencing + de-dup

Observability

SLIs: send→ack latency, drop rate, WS reconnect rate
Tracing from gateway to message persistence to fanout
Dashboards: lag, partitions, consumer health

Implementation Notes

Use WebSocket multiplexing to handle multiple conversations per connection
Implement message sequencing with conversation-specific counters or timestamps
Design heartbeat mechanisms to detect connection drops and handle reconnection
Use message deduplication with idempotency keys to prevent duplicate sends
Implement exponential backoff for message retry logic and delivery failures

Best Practices

Design for eventual consistency: prioritize availability over strict ordering across groups
Implement proper connection pooling and load balancing for WebSocket gateways
Use pub/sub patterns for efficient message fanout to multiple recipients
Design offline message storage with appropriate retention policies
Implement comprehensive monitoring for message delivery SLAs and connection health

Common Pitfalls

Not handling WebSocket connection drops gracefully - leads to message loss
Poor message ordering implementation causing out-of-sequence delivery
Insufficient rate limiting allowing spam and DoS attacks
Not implementing proper presence management leading to "ghost" online users
Synchronous processing of group message fanout causing latency spikes