The RAG pipeline in detail

What RAG actually means

RAG -- Retrieval-Augmented Generation -- is the pattern that makes modern AI assistants factually grounded. Instead of asking the language model to recall rules from training data (it was never trained on your game rulebook), I retrieve the relevant text at query time and hand it to the model as context.

The critical benefit: the model can only cite what I give it. It cannot hallucinate rules that are not in the chunks.

Loading diagram...

Each node in this diagram is a real function call. There is no magic -- just a well-engineered sequence.

Query expansion

Before embedding the question, I run it through the corpus-analyzer service. This uses BERTopic topic models built on the actual content of community forums to expand the query with related terms. A question like "can I take back a move" gets expanded with related phrases from the game domain vocabulary, improving recall. Query expansion uses Claude Haiku 4.5.

Question categories

The extraction step classifies every question into one of 7 categories, which determines which synthesis template is used:

Category	Description	Example
`YES_NO`	Permissibility and legality questions	"Can I attack on my first turn?"
`RULE_EXPLANATION`	How a rule works	"How does flanking work?"
`PROCEDURAL`	Step-by-step actions	"How do I set up the game?"
`OVERVIEW`	High-level summaries	"How does the combat system work overall?"
`EDGE_CASE`	Ambiguous or interaction rules	"What happens if both players run out of cards?"
`MULTI_QUESTION`	Multiple distinct questions	"Can I move through walls, and can I carry items?"
`COUNT_ENUMERATION`	Totals and complete listings	"How many growth cards are there?"

COUNT_ENUMERATION was added specifically to handle questions that ask for a quantity or a complete list of a discrete game element. These route to a dedicated synthesis template that states the total first, then enumerates from the relevant table or list. A component list or table that contains the answer is treated as a found answer — not a "no result" just because the answer is structured data rather than prose.

Context assembly

Retrieval matches on small passages — precise targets for a specific rule. After ranking, I assemble context from the surrounding sections those passages belong to, not just the passages themselves. This means the language model sees each matched rule in its natural context rather than as an isolated fragment.

Context is assembled from two pools:

Prose sections: up to 8 parent sections, ordered by best child passage score
Table sections: up to 4 table sections (component lists, card tables) with their own budget, not competing against prose sections

This two-budget approach means a large component table doesn't crowd out the prose rules that govern how those components are used — and a relevant table isn't dropped just because highly-scored prose sections consumed the cap first.

I cap context at roughly 3,000 tokens — enough to cover the passages and their parent sections for 3–6 distinct rules. If the question explicitly spans multiple rules sections, I use the multi-question synthesis template which handles longer context differently.

Hybrid retrieval and reranking

Retrieval is not purely vector-based. Each search combines cosine similarity with BM25 keyword matching. A per-PDF BM25-floor reservation ensures a strongly-matched source is never dropped even if its cosine score sits slightly below the cutoff — so a game with a single highly relevant section won't be edged out by a game with more but weaker matches.

After the hybrid search returns candidates, a deterministic LLM reranker (Claude Haiku 4.5) scores each passage against the original question. This step catches mismatches that cosine similarity alone would miss — for example, a passage that uses similar vocabulary but discusses a different game mechanic.

Tier 1 vs Tier 2 comparison

Dimension	Tier 1	Tier 2
Source material	Official rulebook only	Rulebook + community threads
Context size	~3,000 tokens	~6,000-10,000 tokens
Synthesis model	Gemini 3 Flash	Gemini 3 Flash (deeper reasoning prompt)
Latency	5-7s	7-35s
Use case	Clear rules questions	Edge cases, interpretation, exceptions
Citation types	[PDF]	[PDF] + [T] community threads

The corpus analyzer

The corpus-analyzer service (port 3481) runs BERTopic and CADA to:

Classify the question into one of the 7 categories above
Expand the query with domain-specific synonyms
Detect ambiguous terms that need disambiguation

The category drives which synthesis YAML template I use — count/enumeration questions get a different prompt than yes/no questions, and edge-case questions get a different prompt than procedural ones.