Vector search & embeddings

How semantic search finds the right chunks

Traditional keyword search finds exact word matches. Vector search finds chunks that mean the same thing as your question, even if they use completely different words. That is the core of why I can answer "what happens if I run out of cards" when the rulebook says "when the draw pile is exhausted".

Loading diagram...

The HNSW index is what makes this fast. Without it, a full-table cosine similarity scan over millions of vectors would take seconds. With HNSW, it takes 10–50 milliseconds.

HNSW: Hierarchical Navigable Small World

HNSW is an approximate nearest-neighbor algorithm. It builds a multi-layer graph where each node is a vector. At query time it navigates from a random entry point, following edges to progressively closer neighbors. The "approximate" part means it might miss the absolute nearest neighbor 1–2% of the time in exchange for being 100x faster than exact search.

For rules questions, this trade-off is excellent. Missing the single most relevant chunk by 1% does not meaningfully affect answer quality.

Hybrid retrieval: vector + BM25

Vector cosine similarity alone can miss passages that are lexically close to the question but sit in an unexpected part of embedding space. To guard against this, retrieval combines vector similarity with BM25 keyword matching. The two scores are merged using reciprocal rank fusion.

A per-PDF BM25-floor reservation means a strongly-matched source PDF cannot be dropped by score alone — even if its cosine similarity is slightly below the dynamic cutoff. This is important for games where the relevant rule appears in only one short section: that section won't be edged out by higher-volume but less relevant content from other PDFs.

LLM reranker

After the hybrid search produces a candidate set, a deterministic LLM reranker (Claude Haiku 4.5) scores each passage against the original question. The reranker catches passages that share vocabulary with the question but discuss a different mechanic, and it lifts passages with correct content that cosine similarity underscored. Reranker latency is typically 80–150ms.

Parent-section expansion

Vector search matches on small child passages — precise targets. But answers are written from the full surrounding parent section. After reranking, matched child passages are expanded to their parent sections (small-to-big retrieval), so the language model sees each rule in context rather than as an isolated fragment. Up to 8 prose parent sections and 4 table sections are assembled.

Why 768 dimensions?

jina-v2-small-en produces 768-dimensional vectors. The choice reflects:

Semantic resolution: 768 dimensions capture fine-grained meaning differences that smaller models miss.
Index size: manageable at the scale of millions of chunks.
Speed: still fast enough for sub-100ms queries.

Embedding service

All embeddings run through a local embedding service — a Python wrapper around the sentence-transformers library. This runs locally — no external API call needed at query time.

Performance characteristics

Operation	Typical latency
Embed a question (100 tokens)	80–150ms
HNSW search (top-20 candidates)	10–50ms
BM25 merge + floor reservation	<5ms
LLM reranker	80–150ms
Filter by game + threshold	<5ms
Total vector retrieval	~200–400ms

The vector database

Embeddings are stored in PostgreSQL using the pgvector extension. Forum thread embeddings and PDF chunk embeddings share the same storage but are distinguished by type.