Inside the prompt architecture

The pipeline in brief

Answering a board game rules question isn't a single model call. It's a pipeline: classify, retrieve, route, synthesise. Each stage has a specific job, and they're designed to be independent — swap any stage and the others don't break.

The four stages:

Extraction — parse the question and classify it
Retrieval — fetch relevant rulebook sections via hybrid search and reranking
Routing — select a synthesis template based on question type
Synthesis — call the language model with the retrieved context and produce an answer

This pipeline runs inside the rules-orchestrator service. Every question — whether from Telegram, the web chat, or a partner widget — passes through the same stages. The differences between channels are in presentation, not in the core prompt logic.

Step 1: Extraction

Before any retrieval happens, the orchestrator runs a dedicated extraction step. This is its own model call, using a lighter prompt whose sole job is to answer three questions: what game, what language, and what type of question is this?

Two extraction templates exist:

extraction-question-only.yml — used when no game context is already known. Extracts game name, question intent, and language from the raw user message.
extraction.yml — used when a game is already selected. Skips game name detection, focuses on question categorisation.

The extraction model doesn't answer the question. It produces structured output that the orchestrator parses — think of it as a preprocessing classifier that makes downstream template selection tractable.

The detected language is what tells the synthesis model what language to respond in. It's always derived from the question, never from the rulebook or the widget's configured locale.

The seven question categories

The extraction step assigns each question to one of seven categories. Getting this right matters because different question types genuinely need different prompts.

YES_NO — "Can I use two action cards in the same turn?" These need a clear, direct answer up front, followed by the supporting rules. A prompt that buried the yes/no in paragraph three would be frustrating to use.

RULE_EXPLANATION — "How does trading work?" Structured explanation: define the mechanic, describe the steps, note exceptions. More room for elaboration than a yes/no.

PROCEDURAL — "What are the steps to resolve combat?" These benefit from numbered lists, and the prompt specifically requests step-by-step formatting.

OVERVIEW — "How do I set up the game?" Broad questions that draw on multiple rulebook sections. The temperature runs slightly higher here, allowing for more natural explanatory prose.

EDGE_CASE — "What happens if two players both claim the same territory on the same turn?" The hardest category. The prompt emphasises checking for explicit rules, then citing the closest applicable rule, then flagging if the rulebook is simply silent on the matter.

MULTI_QUESTION — "What's the difference between a move action and a sprint, and can I combine them?" Multiple distinct sub-questions. The prompt structures output to address each one separately.

COUNT_ENUMERATION — "How many growth cards are in the game?" Questions asking for a total count or a complete listing of game components or elements. The prompt states the total first, then enumerates from the relevant table or list in the rulebook. A component table counts as a found answer, not a non-answer just because the result is structured data rather than prose.

Misclassification happens occasionally. An edge-case question might get tagged as RULE_EXPLANATION if the edge-case signal isn't strong enough. The fallback template handles this gracefully — not catastrophic, just suboptimal.

Step 2: Retrieval

Retrieval runs in parallel with — and independently of — categorisation. The question is embedded and searched against child chunks in the vector index using a combination of dense vector similarity and BM25 lexical scoring (hybrid retrieval with RRF fusion). Per-PDF coverage slots reserve space so that expansion rulebooks and base rulebooks each get representation in the candidate set.

A deterministic LLM reranker (Claude Haiku 4.5) then scores the top candidates for relevance to the specific question, reordering the list before passage expansion.

Once reranking is complete, the system expands each selected child chunk to its parent section — the broader structural unit it belongs to (a section heading and its full content). Synthesis receives up to 8 deduped parent-section texts rather than raw chunk fragments. This small-to-big approach gives the model fuller context for each rule it synthesises from.

Step 3: Routing

Once the question category is known, the system selects the synthesis template:

YES_NO → synthesis-tier1-yes_no-normal.yml
RULE_EXPLANATION → synthesis-tier1-rule_explanation-normal.yml
PROCEDURAL → synthesis-tier1-procedural-normal.yml
OVERVIEW → synthesis-tier1-overview-normal.yml
EDGE_CASE → synthesis-tier1-edge_case-normal.yml
MULTI_QUESTION → synthesis-tier1-multi_question-normal.yml
COUNT_ENUMERATION → synthesis-tier1-count-normal.yml

These aren't hardcoded string literals. They resolve through configuration parameters pointing to template files. Want to swap the edge-case template? Change the configuration. No code deployment required.

Each template is cached in memory after first load. Disk reads happen once per process start; subsequent requests serve from cache.

If a category-specific template fails to load — file missing, YAML parse error, whatever — the system falls back to the legacy generic synthesis template. Zero synthesis failures, even if a template is misconfigured.

The template system

Templates are YAML files in the orchestrator's template directory. Each template defines:

system prompt — the role and rules for the language model
user prompt — the question, retrieved sections, and formatting instructions
temperature — how much variation to allow in the output
max_tokens — upper bound on response length
model — which model to call (configurable, defaults to Gemini 3 Flash via OpenRouter)

The YAML structure means non-developers can iterate on prompt wording without touching code. Citation format, output language rules, response structure, and answer completeness constraints all live in these files — not scattered through JavaScript.

One firm instruction appears in every template: don't hallucinate. If the retrieved sections don't contain the answer, say so. The model is constrained to what the rulebook says; it can't reason from general board game knowledge. This is intentional. Board game rules are specific, and a plausible-sounding wrong answer is worse than an honest "the rulebook doesn't cover this."

Step 4: Synthesis

Synthesis is the final model call. It receives:

The user's question (in whatever language they asked it)
Up to 8 parent-section texts from the rulebook, expanded from the top-ranked child chunks
Page references attached to each section
The output language instruction
The category-specific system prompt

The model reads the sections, identifies what's relevant, and produces an answer that cites specific pages. Citations appear inline as [PDF1, Pg. 23] style references so users can verify in the physical rulebook.

Context compression kicks in if the total token count would exceed the model's context window. Long sections get summarised before being included — a last resort, since the retrieved sections from a 400-page rulebook typically fit comfortably.

The synthesis model doesn't know the game's name from its training data. It knows only what's in the retrieved sections. This prevents the model from blending in outside knowledge that might contradict the specific rulebook version you've imported.

Tier 1 vs Tier 2: different prompts, different sources

Tier 1 is the standard path: single model call, up to 8 parent-section texts from the official rulebook, answer in 5–7 seconds.

Tier 2 goes deeper. It combines official rulebook sections with community forum threads, and the synthesis prompt changes significantly. Instead of pure rules, the model now synthesises official text alongside community discussion, errata clarifications, and designer intent pulled from forum posts.

Tier 2 uses synthesis-tier2-normal.yml (and category variants). The prompt explicitly tells the model how to handle disagreement between official rules and community interpretation — official wins, but community context can be cited when it adds genuinely useful nuance.

Tier 2 takes 7–35 seconds because it retrieves from two separate data sources and the combined context is larger. It's available when the user explicitly wants deeper research, or when the system detects that the question likely requires more than what the raw rulebook text provides.

Confidence scoring

The extraction step assigns a confidence level — high, medium, or low — based on how well the retrieved sections match the question. A few signals feed into this:

Are the top sections clearly relevant, or is there a big score drop after the first one?
Does the question reference something that appears literally in the retrieved content?
Is this an edge case with only tangential matches?

Confidence doesn't gate the answer. A low-confidence query still gets answered. But the confidence level surfaces in the partner admin dashboard and in the API response, letting operators spot questions the system found hard — useful for identifying gaps in rulebook coverage, or question types that consistently produce weak matches.

Worth being clear: high confidence doesn't mean the answer is correct. The model could still misread a section. Confidence reflects retrieval quality, not answer quality.

Output language control

The language instruction is firm: write the answer in the language of the question.

It appears in every synthesis template's system prompt — not in the user message where it might get overlooked. The phrasing is explicit: something to the effect of "regardless of the language of the rulebook sections, your response must be in {detected language}."

Why the reinforcement? Language models tend to drift toward the language of their input. If the retrieved sections are in English and the question is in Italian, the model's default tendency leans toward English. An explicit, prominent instruction overrides this.

All 24 synthesis templates received a strengthened no-foreign-language rule in the March 2026 update. The prompt cache was flushed afterwards to make sure the updated templates propagated immediately.

What the model is (and isn't) allowed to do

Allowed:

Quote directly from retrieved sections
Synthesise across multiple sections to answer a complex question
Note when the rulebook is ambiguous or silent
Cite specific pages
Explain a rule in clearer language than the rulebook uses, as long as accuracy is preserved
Acknowledge that community interpretation differs from rules-as-written (Tier 2 only)

Not allowed:

Invent rules that aren't in the retrieved sections
Apply general board game knowledge if it contradicts the retrieved text
Speculate about designer intent unless a forum thread directly quotes the designer (Tier 2 only)
Respond in a different language than the question
Inflate citations beyond what was actually used

The constraints aren't just about quality — they're about trust. A user consulting this system during an actual game needs to know the answer comes from their rulebook, not from the model's training data. Once that guarantee breaks down, the system's value breaks down with it.

Questions

Why does the system classify questions before searching? Classification happens before synthesis, not before retrieval. Retrieval uses the question directly. Classification only affects which synthesis template gets used. The two steps are independent.

Can I change what template a question category uses? Yes. Update the relevant configuration parameter and restart the orchestrator. No code change needed. The new template is cached on first load.

What happens if the extraction model misclassifies a question? You get a slightly suboptimal prompt — the wrong template for that question type. In practice this is rare, and the legacy fallback template handles truly ambiguous cases adequately.

Why parent sections and not raw chunks? Vector search finds child chunks (short, ~300-token passages). Synthesis then expands each chunk to its full parent section — the structural unit around that passage. This gives the model a complete rule section rather than a fragment. The cap is 8 deduped parent sections; testing showed this gives the model what it needs without inflating context unnecessarily.

Is the extraction model the same as the synthesis model? No. Extraction and enrichment use GPT-4o-mini (lower cost, since these are classification and annotation tasks). Synthesis uses Gemini 3 Flash. Reranking and query expansion use Claude Haiku 4.5. Each model is selected for the task, not shared across all calls.

What's in the prompt cache? The compiled template content — the fully loaded YAML, ready to be formatted with the question and sections. TTL is long; the cache gets flushed when templates are updated, which prevents stale prompts from serving even after a file change.

Autonomous quality optimisation

The prompt templates don't only change when a developer edits them. A separate service — the quality-optimizer — runs a nightly cycle that evaluates recent interactions and proposes targeted improvements.

The loop:

Triage selects real, completed interactions (filtering out test messages, errors, and unsupported languages).
An LLM-as-judge scores each interaction across four dimensions: accuracy (does the answer match the rulebook?), completeness (is anything missing?), format (is the structure right for the question type?), and relevance (did the answer stay on topic?).
The judge identifies which template file produced the answer and what specific issue, if any, occurred.
If a pattern of issues is found — e.g. EDGE_CASE answers consistently missing conditionality checks — a proposal is generated: a diff against the current YAML file with targeted additions.
The proposal is validated against a 40-question test battery. Score before and after are compared. If the improvement exceeds 8% and no questions regress more than the allowed tolerance, the change is auto-applied. If the improvement is between 3% and 8%, it goes to a human for approval via /admin/quality. Below 3%, the proposal is blocked.
On application: the YAML file is overwritten, the prompt cache is flushed, and the orchestrator service restarts. A git commit records the diff with a standardised message.

This design means the seven category-specific templates can tighten over time without manual intervention after every reported edge case.