Autonomous quality optimisation

The self-improving pipeline

Board Game Librarian includes a background service — the quality-optimizer — that evaluates the Q&A pipeline daily, proposes improvements to the prompt templates that drive synthesis, and applies validated improvements automatically or after admin approval.

The goal is to catch systematic prompt issues before they accumulate and to fix them without requiring manual prompt engineering every time a pattern is reported.

Loading diagram...

Phase 1: Triage

The triage step classifies every interaction:

Classification	Meaning
`below_threshold`	Unusable — error, empty response, system/test message, unsupported language
`needs_attention`	Real question, real answer — send to LLM-as-judge for scoring
`evaluated`	Already scored by the judge in a previous run

Triage is incremental: new interactions are classified on each run. Previously scored interactions are not re-evaluated.

Phase 2: LLM-as-Judge (CC agent)

A Claude Code agent (the "CC agent") picks up pending quality runs and begins scoring.

For each flagged interaction, the judge evaluates:

Dimension	What it checks
Accuracy	Does the answer correctly represent what the rulebook says?
Completeness	Does it cover all parts of the question without omitting key details?
Format	Is the structure appropriate for the question type (YES_NO vs PROCEDURAL vs EDGE_CASE)?
Relevance	Does the answer stay on topic and avoid tangential filler?

Scores are stored as weighted averages (0.0–1.0).

The judge also identifies:

Which synthesis template file was used (from question category + tier)
What specific issue, if any, occurred
Whether a pattern exists across multiple interactions for the same template

Phase 3: Proposal generation

If a pattern is identified — e.g. 15 of the last 20 EDGE_CASE interactions scored below 0.70 on the conditionality dimension — the CC agent generates a YAML diff proposal targeting the specific template.

Each proposal records:

Field	Content
Template file	e.g. `synthesis-tier1-edge_case-normal.yml`
Question category	e.g. `EDGE_CASE`
Issue summary	Plain-language description of the problem
Original content	Original YAML content
Proposed content	Proposed YAML content

Phase 4: Test battery

Before any proposal is applied, it is validated against a representative set of questions. An internal test endpoint processes each question using the proposed template in isolation, without affecting live traffic.

The test battery contains 40 questions sampled across question categories and games, with known expected answer characteristics. The score before and after the proposal is compared.

Decision thresholds

Delta	Decision
>= 8% improvement, <= 2 regressions	Auto-apply — applied immediately
3–8% improvement	Pending approval — admin decides at `/admin/quality`
< 3% improvement or > 2 regressions	Blocked — proposal discarded

Thresholds are configurable via configuration parameters.

Safe deployment

When a proposal is applied (auto or admin-approved):

The YAML template file is overwritten with the proposed content
The prompt cache is flushed
The orchestrator service restarts
A git commit records the change

If something goes wrong after deployment, the admin can roll back via the /admin/quality interface. Rollback overwrites the template with the original content and repeats the flush and restart sequence.

Admin interface

The admin dashboard at /admin/quality shows all runs with:

Run status (pending, running, pending approval, auto-applied, approved, blocked, rolled back)
Number of interactions triaged and flagged
Test battery question count
Score before and after (delta as a percentage)
Decision and decision reason
Per-run detail page showing the proposals and their diffs

Manual runs can be triggered from the dashboard. The default schedule is nightly at 02:00 UTC.

Service details

Default sample size: 120 interactions per run
Battery size: 40 questions per proposal

Inside the prompt architecture — the YAML templates the optimizer modifies
The Question & Answer Pipeline — the pipeline the optimizer monitors
The RAG Pipeline in Detail — retrieval layer the optimizer works above