News Aggregator

Distributed search with indexing pipeline, shard/replica management, and low-latency queries.

Learning Objectives

By the end of this case study, you will understand:

  • Design horizontally scalable search architecture with sharding
  • Implement efficient indexing pipeline with near real-time updates
  • Build query routing and result merging for distributed search
  • Design relevance scoring and ranking algorithms (BM25, neural)
  • Implement high availability through replica management

Real-World Examples

Elasticsearch: Powers search for Netflix, Uber, and thousands of companies

Google Search: Handles 8.5 billion searches per day across trillions of pages

Amazon Product Search: Powers search across 600+ million products

Shopify Search: Enables product discovery for 4+ million businesses

Requirements

Functional Requirements

  • Index documents with fields; update/delete
  • Search with filters, sorting, pagination
  • Reindex segments and schema evolution

Non-functional Requirements

  • Low p95 query latency under load
  • High availability via replicas
  • Near real-time index freshness

High-Level Design

  • Write pipeline: parse/analyze → index
  • Query router: route → fanout → merge
  • Shard/replica management for HA and scale

Capacity & Sizing

  • Documents count, average size, terms per doc
  • Posting lists size and memory for caches
  • QPS for search and indexing throughput

Key Components

  • Indexer, Query Router
  • Shard nodes (storage + search executors)
  • Coordinator for cluster state

Architecture

High-level components and data flow

Data Model

Core entities and relationships

  • documents (doc_id PK, title, body, ts)
  • terms (term PK, df)
  • postings (term, doc_id, tf, positions)

APIs

  • POST /api/index { doc }
  • GET /api/search?q=...&topk=...
  • POST /api/reindex/segment { segmentId }

Hot Path

  1. Search: route → fanout → score → merge top-k

Caching & TTL

  • Cache hot queries and partial results with short TTLs
  • Warm caches for frequent filters and facets

Scaling

  • Shard by term range or hash; replicas for HA
  • Columnar/segment storage for compression and speed
  • Cache hot queries and postings lists

Trade-offs

  • Freshness vs query performance (near-real-time)
  • BM25 vs neural rerank compute cost
  • Strict consistency vs availability under failures

Failure Modes & Mitigations

  • Shard outage → route to replicas; degrade features
  • Index corruption → restore from checkpoints
  • Skewed terms → adaptive query planning

Implementation Notes

  • Use inverted indexes with posting lists for efficient term lookup
  • Implement segment-based architecture for concurrent read/write operations
  • Design query parser with support for complex boolean expressions
  • Use skip lists or compressed indexes for memory efficiency
  • Implement real-time search with incremental index updates

Best Practices

  • Design schema carefully - field types affect query performance significantly
  • Implement proper text analysis pipeline (tokenization, stemming, synonyms)
  • Use appropriate data structures: tries for autocomplete, bloom filters for existence
  • Implement query optimization and caching at multiple levels
  • Design for gradual index rebuilds without service interruption

Common Pitfalls

  • Not considering term frequency distribution - hot terms can kill performance
  • Insufficient memory allocation for posting lists and caches
  • Poor shard key selection leading to uneven data distribution
  • Not implementing proper circuit breakers for slow queries
  • Inadequate testing with production-scale data and query patterns

Observability

  • SLIs: p95 query latency, error rate, tail latency by shard
  • Indexing lag and segment merge metrics
  • Cache hit ratios and hot term dashboards