Real-time Chat Service
Designing a WhatsApp/Slack style messaging backend.
Learning Objectives
By the end of this case study, you will understand:
- Design WebSocket connections and real-time message delivery at scale
- Implement message ordering and delivery guarantees in distributed systems
- Build presence systems and typing indicators with efficient state management
- Design group chat scaling with fanout strategies and message routing
- Handle offline message delivery and push notification systems
Real-World Examples
WhatsApp: Handles 100+ billion messages daily across 2 billion users with 99.9% uptime
Slack: Processes millions of messages for 20+ million daily active users in team workspaces
Discord: Powers real-time voice and text chat for 150+ million monthly active users
Telegram: Delivers messages with military-grade encryption to 700+ million users globally
Gateway + Message service + Pub/Sub + Presence
Requirements
Functional Requirements
- User Management: Registration, authentication, profile management
- Real-time Messaging: Send/receive text messages instantly
- Group Chats: Create groups, add/remove members, admin controls
- Message Status: Sent, delivered, read receipts
- Media Sharing: Images, videos, documents, voice messages
- User Presence: Online, offline, last seen status
- Message History: Persistent storage and retrieval
- Notifications: Push notifications for offline users
- Typing Indicators: Real-time typing status in conversations
Non-Functional Requirements
- Scale: 500M users, 100M DAU, 40B messages/day
- Latency: < 100ms message delivery in same region
- Availability: 99.9% uptime (8.7 hours downtime/year)
- Consistency: Eventually consistent for message ordering
- Storage: 1PB+ message storage, 10PB+ media storage
- Bandwidth: 10GB/s peak message throughput
- Security: End-to-end encryption, user privacy
Capacity Planning & Calculations
Message Volume
- 40B messages/day = 463K messages/second
- Peak traffic (3x avg) = 1.4M messages/second
- Average message size = 100 bytes
- Storage per day = 40B × 100 bytes = 4TB/day
Connection Management
- 100M DAU, 30% concurrent = 30M connections
- WebSocket overhead = 2KB per connection
- Total memory for connections = 60GB
- Connections per server = 10K (memory efficient)
Storage Requirements
- Messages: 4TB/day × 365 days = 1.5PB/year
- Media: 10x text volume = 15PB/year
- Indexes: 20% of message data = 300TB/year
- Replication factor: 3x for durability
High-Level Design
- Gateway: WebSocket/HTTP long-poll for real-time connections
- Message service: persist messages; sequence per convo
- Pub/Sub bus (Kafka/Pulsar) for fanout and notifications
- Presence service; state in Redis with TTL
Capacity & Sizing
- Concurrent sessions: target millions globally; shard gateways by session hash
- Message rate: peak TPS, per-conversation ordering constraints
- Storage: retention policy (e.g., 6–12 months), media offloaded to object storage
Key Components
- Gateway (WS/HTTP), Auth, Session resumption
- Message Service (idempotent writes, sequencing)
- Pub/Sub (topics per conversation/user inbox)
- Presence service with TTL-based keys
- Notification fanout (push/email/SMS) for offline users
Non-functional Requirements
- Availability: 99.99% for message send/receive paths
- Latency: P95 deliver < 150ms intra-region
- Durability: messages stored once- and idempotent writes
- Scale: millions of concurrent sessions
Storage
- Messages: partition by conversation, order by sequence
- Index by user for inbox/unread counts
- Object storage for media; signed URLs
Conversations, messages, receipts, and presence
- conversations (
convo_idPK,type,created_at) - participants (
convo_id,user_id,role) - messages (
convo_id,seq,sender_id,ts,body,media_url) - receipts (
convo_id,seq,user_id,delivered_at,seen_at)
APIs
- REST:
POST /api/conversationsPOST /api/conversations/:id/messagesGET /api/conversations/:id/messages?after=seq- WebSocket events:
send_message { convoId, clientSeq, body }message_ack { convoId, clientSeq, serverSeq }presence_update { userId, status }
Hot Path
- Client → Gateway (auth session, assign convo sequence)
- Message Service write (idempotent by clientSeq), publish to bus
- Consumers push to online recipients; enqueue notifications for offline
- Ack round-trip, update receipts
Delivery Flow
- Client sends message → Gateway authenticates
- Message service writes and publishes event
- Consumers push to recipients; offline → notifications
- Acks update delivery/seen states
Scaling
- Shard gateways by consistent hashing of session IDs
- Partition topics by conversation; compact logs for retention
- Backpressure with bounded queues and drop policies for presence
Caching & TTL
- Short TTL presence entries (e.g., 30–120s) in Redis
- Cache recent message indices per conversation for fast pagination
- Edge cache media via signed URLs with appropriate TTL
Trade-offs
- Ordering per conversation vs global ordering
- At-least-once delivery with idempotent writes
- Backpressure on gateway during spikes
Failure Modes & Mitigations
- Gateway crash → client reconnect with session resumption tokens
- Consumer lag → scale consumers, enable catch-up reads
- Out-of-order delivery → per-conversation sequencing + de-dup
Observability
- SLIs: send→ack latency, drop rate, WS reconnect rate
- Tracing from gateway to message persistence to fanout
- Dashboards: lag, partitions, consumer health
Implementation Notes
- Use WebSocket multiplexing to handle multiple conversations per connection
- Implement message sequencing with conversation-specific counters or timestamps
- Design heartbeat mechanisms to detect connection drops and handle reconnection
- Use message deduplication with idempotency keys to prevent duplicate sends
- Implement exponential backoff for message retry logic and delivery failures
Best Practices
- Design for eventual consistency: prioritize availability over strict ordering across groups
- Implement proper connection pooling and load balancing for WebSocket gateways
- Use pub/sub patterns for efficient message fanout to multiple recipients
- Design offline message storage with appropriate retention policies
- Implement comprehensive monitoring for message delivery SLAs and connection health
Common Pitfalls
- Not handling WebSocket connection drops gracefully - leads to message loss
- Poor message ordering implementation causing out-of-sequence delivery
- Insufficient rate limiting allowing spam and DoS attacks
- Not implementing proper presence management leading to "ghost" online users
- Synchronous processing of group message fanout causing latency spikes