April 24, 20267 min readRishi

Designing a Scalable Search Engine: From Inverted Index to Hybrid Vector Search

Search is one of those problems that looks simple from the outside — "just query a database" — and reveals extraordinary depth when you need sub-100ms latency at 10,000 queries per second with relevance that users actually trust. Whether you're preparing for a system design interview or building the real thing, here's how a production search system is structured.

What We're Building

A search system for an e-commerce platform: ~50 million product documents, ~10,000 queries per second at peak, p99 latency under 100ms, results ranked by a blend of text relevance and user engagement signals.

The fundamental challenge: a naive full-table scan across 50 million documents per query takes seconds. We need a data structure designed for search from the ground up.

The Inverted Index

The core data structure of every search engine — from Lucene to Elasticsearch to Postgres full-text search — is the inverted index. Instead of mapping documents → words, it maps words → documents.

For three documents:

Doc 1: "fast wireless keyboard"
Doc 2: "wireless mechanical keyboard"
Doc 3: "fast gaming mouse"

The inverted index is:

"fast"       → [Doc1, Doc3]
"wireless"   → [Doc1, Doc2]
"keyboard"   → [Doc1, Doc2]
"mechanical" → [Doc2]
"gaming"     → [Doc3]
"mouse"      → [Doc3]

A query for "wireless keyboard" becomes: postings("wireless") ∩ postings("keyboard") = [Doc1, Doc2]. This is an O(k) set intersection where k is the posting list length — far faster than a full scan.

Production posting lists also store positions (for phrase queries like "mechanical keyboard"), term frequencies (for BM25 ranking), and skip pointers (to jump ahead in large lists during AND operations).

Text Processing Pipeline

Before indexing, text passes through a processing pipeline:

Tokenisation — split on whitespace/punctuation: ["fast", "wireless", "keyboard"]
Lowercasing — Keyboard → keyboard
Stop word removal — "the", "a", "of" are discarded (low information value)
Stemming / Lemmatisation — keyboards → keyboard, running → run
Synonym expansion (at query time) — "laptop" → also matches "notebook"

The same pipeline must run on both indexed documents and search queries. If you stem at index time, you must stem at query time — otherwise "keyboards" in a query won't match "keyboard" in the index.

Ranking: BM25

BM25 (Best Match 25) is the standard baseline relevance function. For a query term t in document d:

BM25(t, d) = IDF(t) × TF(t,d) × (k1 + 1)
             ─────────────────────────────────────────────
             TF(t,d) + k1 × (1 - b + b × |d| / avgdl)

Where:

IDF — inverse document frequency. Rare terms score higher than common ones.
TF — term frequency in the document.
|d| — document length. Long documents are penalised so they don't dominate just by having more words.
k1 (default 1.2) and b (default 0.75) are tunable parameters.

In practice you tune k1 and b against a labelled relevance dataset. Elasticsearch uses BM25 as its default since version 5.0.

BM25 is text relevance only. Production systems blend it with:

Popularity signals — click-through rate, purchase rate, view count
Freshness — newer listings ranked higher in trending searches
Personalisation — user history, location, A/B segment assignment

The blending is usually a linear combination or a learning-to-rank model (LambdaMART is common) trained on logged search sessions.

System Architecture

Write Path

Product Service
    ↓ (writes)
Change Data Capture → Kafka (topic: product-changes)
    ↓ (consumers)
Indexing Workers (stateless, horizontally scalable)
    ↓
Text processing pipeline
    ↓
Elasticsearch Cluster (primary index)

The indexing workers consume product change events, run the text processing pipeline, and bulk-index into Elasticsearch. Bulk indexing (batches of 100–500 documents) is 10–20× faster than individual document calls.

Read Path

API Gateway
    ↓
Search Service
    ├── Query parsing & expansion
    ├── Redis cache lookup (query → top-10 results)
    │       ↓ cache miss
    ├── Elasticsearch query (BM25 + filters)
    │       ↓
    ├── Re-ranking (blend BM25 + engagement signals)
    └── Response serialisation → client

Caching is critical. The top 1% of queries account for roughly 50% of traffic. These popular queries — "iphone case", "wireless headphones" — can be served entirely from Redis. Cache keys are the normalised query string (lowercased, stop words removed). TTL: 5 minutes for trending queries, 1 hour for navigational queries.

Elasticsearch Cluster Sizing

For 50M documents at ~1KB each:

Data nodes:        6 × (16 vCPU, 64 GB RAM, 4 TB NVMe SSD)
Coordinator nodes: 3 × (8 vCPU, 16 GB RAM) — route queries, hold no data
Index:             12 primary shards, 1 replica → 24 shards total

Rule of thumb: each shard should hold 20–40 GB of index data. ~50 million documents × 3× for index structures ≈ 150 GB / 12 shards = 12.5 GB per shard.

Adding Vector Search

BM25 fails for semantic queries. "lightweight laptop for travel" won't match a document that says "ultrabook 1.2 kg with all-day battery" — different words, same intent. Vector search addresses this.

Every product document is embedded into a dense vector — common dimensions in production today are 768 (sentence-BERT family), 1,536 (OpenAI text-embedding-3-small / ada-002), and 3,072 (OpenAI text-embedding-3-large's default). Pick the model first, then the dimension follows. At query time, the query is embedded with the same model and the index returns the nearest neighbours by cosine similarity.

HNSW (Hierarchical Navigable Small World) is the standard approximate nearest-neighbour (ANN) index. Elasticsearch, Weaviate, Pinecone, and pgvector all implement it.

The catch: pure vector search has poor recall for exact matches. "Sony WH-1000XM5" should return that exact product, not semantically similar headphones. The solution is hybrid search:

final_score = α × BM25_score + (1 - α) × vector_score

Where α is tuned per query type — navigational queries (product names, SKUs) weight BM25 heavily; exploratory queries ("gifts for dad") weight vector search.

Elasticsearch supports hybrid search natively via the rrf (Reciprocal Rank Fusion) retriever, which merges ranked lists from BM25 and kNN without requiring score normalisation.

Handling Scale: Scatter-Gather

When your corpus is too large for one cluster, shard across clusters by category:

Search Coordinator
├── Cluster A (Electronics)
├── Cluster B (Apparel)
└── Cluster C (Home & Garden)

The coordinator fans out the query to all clusters in parallel (scatter), collects the top-K results from each, merges and re-ranks the combined set (gather), and returns the global top-K.

Merging is non-trivial: BM25 scores from different clusters aren't directly comparable if their IDF statistics differ. Solutions:

Normalise scores per cluster before merging (divide each score by the cluster's max score)
Use Reciprocal Rank Fusion — merge by rank position rather than raw score. RRF is immune to score scale differences, which makes it the safer default.

Autocomplete

Autocomplete requires a separate data structure from the main index. The standard approach: store the top ~1 million historical query strings with their result counts, indexed by prefix. Serve from Redis (Sorted Sets) or a custom trie for sub-5ms response times.

Elasticsearch's completion field uses an FST (Finite State Transducer) for memory-efficient prefix lookup. For 1 million query strings, the FST fits in ~100 MB of RAM.

The Operational Reality

Index lag matters. CDC → Kafka → indexer introduces 2–10 second lag. For flash sales or price drops, this is acceptable. For inventory (showing "in stock" when it's not), you may need a write-through cache that reflects updates before the search index catches up.

Query understanding is where the leverage is. Fixing ranking algorithms gets you 5–10% relevance improvement. Understanding user intent — spell correction, synonym expansion, query reformulation — gets you 30–50%. Invest in query understanding before advanced ranking models.

Monitor search quality with behaviour signals. Track click-through rate, zero-result rate, result abandonment rate, and reformulation rate (user modifies their query after seeing results). These tell you whether your search is actually working — no labelled dataset required.

The inverted index is 50 years old. Combine it with modern vector embeddings and a fast coordinator layer, and you have the architecture running most of the search systems people use every day.

SharePost Share

Keep reading

Apr 25, 20268 min read

Event-Driven System Design: The Decisions That Bite You Later

A practical guide to the design decisions that determine whether an event-driven system stays maintainable or quietly rots — delivery guarantees, ordering, idempotency, schema evolution, and the outbox.

system-design tutorial

Apr 11, 202611 min read

Designing a Scalable Notification System

A system design deep dive into building a notification platform that handles push, email, SMS, and in-app notifications at scale — covering architecture, priority queues, fan-out strategies, rate limiting, and delivery tracking.

system-design

Apr 9, 20268 min read

Designing a Production-Ready RAG System: What Actually Matters

Retrieval-augmented generation demos look great. Production RAG is a different engineering problem — chunking, hybrid retrieval, reranking, evaluation, and the failure modes nobody mentions at conferences.