Inside the prompt architecture
Last updated: March 31, 2026
Inside the prompt architecture
The pipeline in brief
Answering a board game rules question isn't a single model call. It's a pipeline: classify, retrieve, route, synthesise. Each stage has a specific job, and they're designed to be independent — swap any stage and the others don't break.
The four stages:
- Extraction — parse the question and classify it
- Retrieval — fetch relevant rulebook chunks via vector search
- Routing — select a synthesis template based on question type
- Synthesis — call the language model with the retrieved chunks and produce an answer
This pipeline runs inside the rules-orchestrator service. Every question — whether from Telegram, the web chat, or a partner widget — passes through the same stages. The differences between channels are in presentation, not in the core prompt logic.
Step 1: Extraction
Before any retrieval happens, the orchestrator runs a dedicated extraction step. This is its own model call, using a lighter prompt whose sole job is to answer three questions: what game, what language, and what type of question is this?
Two extraction templates exist:
extraction-question-only.yml— used when no game context is already known. Extracts game name, question intent, and language from the raw user message.extraction.yml— used when a game is already selected. Skips game name detection, focuses on question categorisation.
The extraction model doesn't answer the question. It produces structured output that the orchestrator parses — think of it as a preprocessing classifier that makes downstream template selection tractable.
The detected language is what tells the synthesis model what language to respond in. It's always derived from the question, never from the rulebook or the widget's configured locale.
The six question categories
The extraction step assigns each question to one of six categories. Getting this right matters because different question types genuinely need different prompts.
YES_NO — "Can I use two action cards in the same turn?" These need a clear, direct answer up front, followed by the supporting rules. A prompt that buried the yes/no in paragraph three would be frustrating to use.
RULE_EXPLANATION — "How does trading work?" Structured explanation: define the mechanic, describe the steps, note exceptions. More room for elaboration than a yes/no.
PROCEDURAL — "What are the steps to resolve combat?" These benefit from numbered lists, and the prompt specifically requests step-by-step formatting.
OVERVIEW — "How do I set up the game?" Broad questions that draw on multiple rulebook sections. The temperature runs slightly higher here, allowing for more natural explanatory prose.
EDGE_CASE — "What happens if two players both claim the same territory on the same turn?" The hardest category. The prompt emphasises checking for explicit rules, then citing the closest applicable rule, then flagging if the rulebook is simply silent on the matter.
MULTI_QUESTION — "What's the difference between a move action and a sprint, and can I combine them?" Multiple distinct sub-questions. The prompt structures output to address each one separately.
Misclassification happens occasionally. An edge-case question might get tagged as RULE_EXPLANATION if the edge-case signal isn't strong enough. The fallback template handles this gracefully — not catastrophic, just suboptimal.
Step 2: Routing
Once the question category is known, _getSynthesisConfig() selects the synthesis template:
- YES_NO →
synthesis-tier1-yes_no-normal.yml - RULE_EXPLANATION →
synthesis-tier1-rule_explanation-normal.yml - PROCEDURAL →
synthesis-tier1-procedural-normal.yml - OVERVIEW →
synthesis-tier1-overview-normal.yml - EDGE_CASE →
synthesis-tier1-edge_case-normal.yml - MULTI_QUESTION →
synthesis-tier1-multi_question-normal.yml
These aren't hardcoded string literals. They resolve through approximately 45 PROMPT_* environment variables pointing to template files. Want to swap the edge-case template? Change the env var. No code deployment required.
Each template is cached in Redis (DB 2) after first load. Disk reads happen once per process start; subsequent requests serve from in-memory cache.
If a category-specific template fails to load — file missing, YAML parse error, whatever — the system falls back to the legacy generic synthesis template. Zero synthesis failures, even if a template is misconfigured.
The template system
Templates are YAML files under the orchestrator's prompts/templates/ directory. Each template defines:
- system prompt — the role and rules for the language model
- user prompt — the question, retrieved chunks, and formatting instructions
- temperature — how much variation to allow in the output
- max_tokens — upper bound on response length
- model — which model to call (configurable via env vars, defaults to GPT-4o or Claude via OpenRouter)
The YAML structure means non-developers can iterate on prompt wording without touching code. Citation format, output language rules, response structure, and answer completeness constraints all live in these files — not scattered through JavaScript.
One firm instruction appears in every template: don't hallucinate. If the retrieved chunks don't contain the answer, say so. The model is constrained to what the rulebook says; it can't reason from general board game knowledge. This is intentional. Board game rules are specific, and a plausible-sounding wrong answer is worse than an honest "the rulebook doesn't cover this."
Step 3: Synthesis
Synthesis is the final model call. It receives:
- The user's question (in whatever language they asked it)
- The top 10 rulebook chunks from vector search
- Page references attached to each chunk
- The output language instruction
- The category-specific system prompt
The model reads the chunks, identifies what's relevant, and produces an answer that cites specific pages. Citations appear inline as [PDF1, Pg. 23] style references so users can verify in the physical rulebook.
Context compression kicks in if the total token count would exceed the model's context window. Long chunks get summarised before being included — a last resort, since 10 chunks from a 400-page rulebook typically fits comfortably.
The synthesis model doesn't know the game's name from its training data. It knows only what's in the retrieved chunks. This prevents the model from blending in outside knowledge that might contradict the specific rulebook version you've imported.
Tier 1 vs Tier 2: different prompts, different sources
Tier 1 is the standard path: single model call, top 10 rulebook chunks, answer in 5–7 seconds.
Tier 2 goes deeper. It combines official rulebook chunks with community forum threads, and the synthesis prompt changes significantly. Instead of pure rules, the model now synthesises official text alongside community discussion, errata clarifications, and designer intent pulled from forum posts.
Tier 2 uses synthesis-tier2-normal.yml (and category variants). The prompt explicitly tells the model how to handle disagreement between official rules and community interpretation — official wins, but community context can be cited when it adds genuinely useful nuance.
Tier 2 takes 7–35 seconds because it retrieves from two separate data sources and the combined context is larger. It's available when the user explicitly wants deeper research, or when the system detects that the question likely requires more than what the raw rulebook text provides.
Confidence scoring
The extraction step assigns a confidence level — high, medium, or low — based on how well the retrieved chunks match the question. A few signals feed into this:
- Are the top chunks clearly relevant, or is there a big score drop after chunk #2?
- Does the question reference something that appears literally in the chunks?
- Is this an edge case with only tangential chunk matches?
Confidence doesn't gate the answer. A low-confidence query still gets answered. But the confidence level surfaces in the partner admin dashboard and in the API response, letting operators spot questions the system found hard — useful for identifying gaps in rulebook coverage, or question types that consistently produce weak matches.
Worth being clear: high confidence doesn't mean the answer is correct. The model could still misread a chunk. Confidence reflects retrieval quality, not answer quality.
Output language control
The language instruction is firm: write the answer in the language of the question.
It appears in every synthesis template's system prompt — not in the user message where it might get overlooked. The phrasing is explicit: something to the effect of "regardless of the language of the rulebook chunks, your response must be in {detected language}."
Why the reinforcement? Language models tend to drift toward the language of their input. If 10 chunks are in English and the question is in Italian, the model's default tendency leans toward English. An explicit, prominent instruction overrides this.
All 20 synthesis templates received a strengthened no-foreign-language rule in the March 2026 update. The Redis prompt cache was flushed afterwards to make sure the updated templates propagated immediately.
What the model is (and isn't) allowed to do
Allowed:
- Quote directly from retrieved chunks
- Synthesise across multiple chunks to answer a complex question
- Note when the rulebook is ambiguous or silent
- Cite specific pages
- Explain a rule in clearer language than the rulebook uses, as long as accuracy is preserved
- Acknowledge that community interpretation differs from rules-as-written (Tier 2 only)
Not allowed:
- Invent rules that aren't in the retrieved chunks
- Apply general board game knowledge if it contradicts the retrieved text
- Speculate about designer intent unless a forum thread directly quotes the designer (Tier 2 only)
- Respond in a different language than the question
- Inflate citations beyond what was actually used
The constraints aren't just about quality — they're about trust. A user consulting this system during an actual game needs to know the answer comes from their rulebook, not from the model's training data. Once that guarantee breaks down, the system's value breaks down with it.
Questions
Why does the system classify questions before searching? Classification happens before synthesis, not before retrieval. Retrieval uses the question directly. Classification only affects which synthesis template gets used. The two steps are independent.
Can I change what template a question category uses?
Yes. Update the relevant PROMPT_* environment variable and restart the orchestrator. No code change needed. The new template is cached in Redis on first load.
What happens if the extraction model misclassifies a question? You get a slightly suboptimal prompt — the wrong template for that question type. In practice this is rare, and the legacy fallback template handles truly ambiguous cases adequately.
Why 10 chunks and not more? Testing showed answer quality didn't improve meaningfully past 10. Beyond that, you're adding tokens (and latency) for noise. The top 10 by cosine similarity already contain the relevant information for almost every question.
Is the extraction model the same as the synthesis model? They can be configured independently via env vars. In the default configuration, extraction uses a lighter model (lower cost per call, since it's just classification) and synthesis uses the more capable model.
What's in the Redis prompt cache? The compiled template content — the fully loaded YAML, ready to be formatted with the question and chunks. Lives in DB 2. TTL is long; the cache gets flushed manually when templates are updated, which prevents stale prompts from serving even after a file change.
Autonomous quality optimisation
The prompt templates don't only change when a developer edits them. A separate service — the quality-optimizer — runs a nightly cycle that evaluates recent interactions and proposes targeted improvements.
The loop:
- SQL triage selects real, completed interactions (filtering out test messages, errors, and unsupported languages).
- An LLM-as-judge scores each interaction across four dimensions: accuracy (does the answer match the rulebook?), completeness (is anything missing?), format (is the structure right for the question type?), and relevance (did the answer stay on topic?).
- The judge identifies which template file produced the answer and what specific issue, if any, occurred.
- If a pattern of issues is found — e.g. EDGE_CASE answers consistently missing conditionality checks — a proposal is generated: a diff against the current YAML file with targeted additions.
- The proposal is validated against a 40-question test battery. Score before and after are compared. If the improvement exceeds 8% and no questions regress more than the allowed tolerance, the change is auto-applied. If the improvement is between 3% and 8%, it goes to a human for approval via
/admin/quality. Below 3%, the proposal is blocked. - On application: the YAML file is overwritten,
redis-cli -n 2 FLUSHDBflushes the prompt cache, andpm2 restart rules-orchestratorapplies the change. A git commit records the diff with a standardised message.
This design means the six category-specific templates can tighten over time without manual intervention after every reported edge case.