Autonomous Quality Optimisation

Last updated: March 31, 2026

The self-improving pipeline

Board Game Librarian includes a background service — the quality-optimizer — that evaluates the Q&A pipeline daily, proposes improvements to the prompt templates that drive synthesis, and applies validated improvements automatically or after admin approval.

The goal is to catch systematic prompt issues before they accumulate and to fix them without requiring manual prompt engineering every time a pattern is reported.

Loading diagram...

Phase 1: Triage

The triage step runs a SQL query against unified_interactions_v2 to classify every interaction:

ClassificationMeaning
below_thresholdUnusable — error, empty response, system/test message, unsupported language
needs_attentionReal question, real answer — send to LLM-as-judge for scoring
evaluatedAlready scored by the judge in a previous run

Triage is incremental: new interactions are classified on each run. Previously scored interactions are not re-evaluated.

Phase 2: LLM-as-Judge (CC agent)

A Claude Code agent (the "CC agent") picks up any quality_runs row with status = pending_cc and begins scoring.

For each flagged interaction, the judge evaluates:

DimensionWhat it checks
AccuracyDoes the answer correctly represent what the rulebook says?
CompletenessDoes it cover all parts of the question without omitting key details?
FormatIs the structure appropriate for the question type (YES_NO vs PROCEDURAL vs EDGE_CASE)?
RelevanceDoes the answer stay on topic and avoid tangential filler?

Scores are stored in quality_interaction_evals.eval_dimensions (JSONB). The overall eval_score is a weighted average (0.0–1.0).

The judge also identifies:

  • Which synthesis template file was used (from question_category + tier)
  • What specific issue, if any, occurred
  • Whether a pattern exists across multiple interactions for the same template

Phase 3: Proposal generation

If a pattern is identified — e.g. 15 of the last 20 EDGE_CASE interactions scored below 0.70 on the conditionality dimension — the CC agent generates a YAML diff proposal targeting the specific template.

The proposal is stored in quality_prompt_proposals:

FieldContent
template_filee.g. synthesis-tier1-edge_case-normal.yml
question_categorye.g. EDGE_CASE
issue_summaryPlain-language description of the problem
diff_beforeOriginal YAML content
diff_afterProposed YAML content

Phase 4: Test battery

Before any proposal is applied, it is validated against a representative set of questions (quality_test_results). The orchestrator's /api/internal/quality/test-synthesis endpoint processes each question using the proposed template in isolation, without affecting live traffic.

The test battery contains 40 questions sampled across question categories and games, with known expected answer characteristics. The score before and after the proposal is compared.

Decision thresholds

DeltaDecision
>= 8% improvement, <= 2 regressionsauto_apply — applied immediately
3–8% improvementpending_approval — admin decides at /admin/quality
< 3% improvement or > 2 regressionsblocked — proposal discarded

Thresholds are configurable via environment variables: QUALITY_AUTO_APPLY_DELTA, QUALITY_PENDING_DELTA, QUALITY_AUTO_APPLY_MAX_DEGRADED.

Safe deployment

When a proposal is applied (auto or admin-approved):

  1. The YAML template file is overwritten with diff_after
  2. Redis prompt cache (DB 2) is flushed: redis-cli -n 2 FLUSHDB
  3. The orchestrator restarts: pm2 restart rules-orchestrator --update-env
  4. A git commit records the change

If something goes wrong after deployment, the admin can roll back via the /admin/quality interface. Rollback overwrites the template with diff_before and repeats the flush+restart sequence.

Admin interface

The admin dashboard at /admin/quality shows all runs with:

  • Run status (pending_cc, running, pending_approval, auto_applied, approved, blocked, rolled_back)
  • Number of interactions triaged and flagged
  • Test battery question count
  • Score before and after (delta as a percentage)
  • Decision and decision reason
  • Per-run detail page showing the proposals and their diffs

Manual runs can be triggered from the dashboard. The default cron schedule is daily at 02:00 UTC.

Service details

  • Service name: quality-optimizer
  • Port: 3482
  • Cron schedule: 0 2 * * * (configurable via QUALITY_CRON_SCHEDULE)
  • Default sample size: 120 interactions per run
  • Battery size: 40 questions per proposal