Notification Service
Designing multi-channel notifications: email, SMS, push, in-app with templates and preferences.
API + Router + Channel workers + Providers
Requirements
- Templates with variables; localization
- User preferences; channel fallbacks and quiet hours
- Deduplication and idempotency
High-Level Design
- API to enqueue notifications; outbox pattern from producers
- Router selects channels based on preferences and urgency
- Per-channel workers with provider retries and DLQs
Capacity & Sizing
- Notifications/sec with burst tolerance; channel throughput limits
- Provider quotas and rate limits per account/region
- DLQ sizing and reprocessing policies
Key Components
- Enqueue API and outbox processors
- Router with preference evaluation and channel selection
- Channel workers with provider adapters
Data Model
Templates, notifications, preferences, and deliveries
- templates (
template_idPK,name,locale,body) - notifications (
notif_idPK,user_id,template_id,status,created_at) - preferences (
user_idPK,email,sms,push,quiet_hours) - deliveries (
delivery_idPK,notif_id,channel,provider,status,attempts,last_error)
APIs
- Enqueue:
POST /api/notificationswith body{ userId, templateId, data } - Preferences:
PUT /api/preferences/:userIdwith body{ email, sms, push, quietHours } - Status:
GET /api/notifications/:id
Hot Path
- Enqueue → persist → publish → route channels
- Deliver via provider → update delivery status
Delivery Flow
- Enqueue notification; persist and publish event
- Router expands into channel-specific deliveries
- Workers call providers with retries/backoff; DLQ on failure
Scaling
- Partition queues by channel; autoscale workers
- Use provider pools and rate limits; fallback providers
- Batch sends when supported (email)
Caching & TTL
- Cache compiled templates per locale with TTL
- Short TTL cache for user preferences; invalidate on update
Failure Modes & Mitigations
- Provider outage → fallback route and DLQ
- Spam/blocks → feedback loops and suppression lists
- Duplicate sends → idempotency keys per notif/user
Observability
- Delivery rates, failure reasons, and time-to-send
- End-to-end tracing across router and workers