Autonomous quality optimisation
Last updated: April 11, 2026
The self-improving pipeline
Board Game Librarian includes a background service — the quality-optimizer — that evaluates the Q&A pipeline daily, proposes improvements to the prompt templates that drive synthesis, and applies validated improvements automatically or after admin approval.
The goal is to catch systematic prompt issues before they accumulate and to fix them without requiring manual prompt engineering every time a pattern is reported.
Loading diagram...
Phase 1: Triage
The triage step classifies every interaction:
| Classification | Meaning |
|---|---|
below_threshold | Unusable — error, empty response, system/test message, unsupported language |
needs_attention | Real question, real answer — send to LLM-as-judge for scoring |
evaluated | Already scored by the judge in a previous run |
Triage is incremental: new interactions are classified on each run. Previously scored interactions are not re-evaluated.
Phase 2: LLM-as-Judge (CC agent)
A Claude Code agent (the "CC agent") picks up pending quality runs and begins scoring.
For each flagged interaction, the judge evaluates:
| Dimension | What it checks |
|---|---|
| Accuracy | Does the answer correctly represent what the rulebook says? |
| Completeness | Does it cover all parts of the question without omitting key details? |
| Format | Is the structure appropriate for the question type (YES_NO vs PROCEDURAL vs EDGE_CASE)? |
| Relevance | Does the answer stay on topic and avoid tangential filler? |
Scores are stored as weighted averages (0.0–1.0).
The judge also identifies:
- Which synthesis template file was used (from question category + tier)
- What specific issue, if any, occurred
- Whether a pattern exists across multiple interactions for the same template
Phase 3: Proposal generation
If a pattern is identified — e.g. 15 of the last 20 EDGE_CASE interactions scored below 0.70 on the conditionality dimension — the CC agent generates a YAML diff proposal targeting the specific template.
Each proposal records:
| Field | Content |
|---|---|
| Template file | e.g. synthesis-tier1-edge_case-normal.yml |
| Question category | e.g. EDGE_CASE |
| Issue summary | Plain-language description of the problem |
| Original content | Original YAML content |
| Proposed content | Proposed YAML content |
Phase 4: Test battery
Before any proposal is applied, it is validated against a representative set of questions. An internal test endpoint processes each question using the proposed template in isolation, without affecting live traffic.
The test battery contains 40 questions sampled across question categories and games, with known expected answer characteristics. The score before and after the proposal is compared.
Decision thresholds
| Delta | Decision |
|---|---|
| >= 8% improvement, <= 2 regressions | Auto-apply — applied immediately |
| 3–8% improvement | Pending approval — admin decides at /admin/quality |
| < 3% improvement or > 2 regressions | Blocked — proposal discarded |
Thresholds are configurable via configuration parameters.
Safe deployment
When a proposal is applied (auto or admin-approved):
- The YAML template file is overwritten with the proposed content
- The prompt cache is flushed
- The orchestrator service restarts
- A git commit records the change
If something goes wrong after deployment, the admin can roll back via the /admin/quality interface. Rollback overwrites the template with the original content and repeats the flush and restart sequence.
Admin interface
The admin dashboard at /admin/quality shows all runs with:
- Run status (pending, running, pending approval, auto-applied, approved, blocked, rolled back)
- Number of interactions triaged and flagged
- Test battery question count
- Score before and after (delta as a percentage)
- Decision and decision reason
- Per-run detail page showing the proposals and their diffs
Manual runs can be triggered from the dashboard. The default schedule is nightly at 02:00 UTC.
Service details
- Default sample size: 120 interactions per run
- Battery size: 40 questions per proposal
Related pages
- Inside the prompt architecture — the YAML templates the optimizer modifies
- The Question & Answer Pipeline — the pipeline the optimizer monitors
- The RAG Pipeline in Detail — retrieval layer the optimizer works above