Autonomous quality optimisation

Last updated: April 11, 2026

The self-improving pipeline

Board Game Librarian includes a background service — the quality-optimizer — that evaluates the Q&A pipeline daily, proposes improvements to the prompt templates that drive synthesis, and applies validated improvements automatically or after admin approval.

The goal is to catch systematic prompt issues before they accumulate and to fix them without requiring manual prompt engineering every time a pattern is reported.

Loading diagram...

Phase 1: Triage

The triage step classifies every interaction:

ClassificationMeaning
below_thresholdUnusable — error, empty response, system/test message, unsupported language
needs_attentionReal question, real answer — send to LLM-as-judge for scoring
evaluatedAlready scored by the judge in a previous run

Triage is incremental: new interactions are classified on each run. Previously scored interactions are not re-evaluated.

Phase 2: LLM-as-Judge (CC agent)

A Claude Code agent (the "CC agent") picks up pending quality runs and begins scoring.

For each flagged interaction, the judge evaluates:

DimensionWhat it checks
AccuracyDoes the answer correctly represent what the rulebook says?
CompletenessDoes it cover all parts of the question without omitting key details?
FormatIs the structure appropriate for the question type (YES_NO vs PROCEDURAL vs EDGE_CASE)?
RelevanceDoes the answer stay on topic and avoid tangential filler?

Scores are stored as weighted averages (0.0–1.0).

The judge also identifies:

  • Which synthesis template file was used (from question category + tier)
  • What specific issue, if any, occurred
  • Whether a pattern exists across multiple interactions for the same template

Phase 3: Proposal generation

If a pattern is identified — e.g. 15 of the last 20 EDGE_CASE interactions scored below 0.70 on the conditionality dimension — the CC agent generates a YAML diff proposal targeting the specific template.

Each proposal records:

FieldContent
Template filee.g. synthesis-tier1-edge_case-normal.yml
Question categorye.g. EDGE_CASE
Issue summaryPlain-language description of the problem
Original contentOriginal YAML content
Proposed contentProposed YAML content

Phase 4: Test battery

Before any proposal is applied, it is validated against a representative set of questions. An internal test endpoint processes each question using the proposed template in isolation, without affecting live traffic.

The test battery contains 40 questions sampled across question categories and games, with known expected answer characteristics. The score before and after the proposal is compared.

Decision thresholds

DeltaDecision
>= 8% improvement, <= 2 regressionsAuto-apply — applied immediately
3–8% improvementPending approval — admin decides at /admin/quality
< 3% improvement or > 2 regressionsBlocked — proposal discarded

Thresholds are configurable via configuration parameters.

Safe deployment

When a proposal is applied (auto or admin-approved):

  1. The YAML template file is overwritten with the proposed content
  2. The prompt cache is flushed
  3. The orchestrator service restarts
  4. A git commit records the change

If something goes wrong after deployment, the admin can roll back via the /admin/quality interface. Rollback overwrites the template with the original content and repeats the flush and restart sequence.

Admin interface

The admin dashboard at /admin/quality shows all runs with:

  • Run status (pending, running, pending approval, auto-applied, approved, blocked, rolled back)
  • Number of interactions triaged and flagged
  • Test battery question count
  • Score before and after (delta as a percentage)
  • Decision and decision reason
  • Per-run detail page showing the proposals and their diffs

Manual runs can be triggered from the dashboard. The default schedule is nightly at 02:00 UTC.

Service details

  • Default sample size: 120 interactions per run
  • Battery size: 40 questions per proposal