How I read rulebooks

From PDF to searchable knowledge

When a publisher adds a rulebook, text extraction happens right away, but the real work — chunking and embedding — is on-demand. The first time someone asks a question about that game, the processing pipeline runs and stores the result. Every query after that is instant.

Here is the full processing pipeline:

Loading diagram...

The key design decision: PDFs are not batch-processed speculatively. That would waste compute on rulebooks nobody ever asks about.

What Apache Tika does

Tika is the extraction engine. It handles every PDF quirk — multi-column layouts, scanned pages, embedded fonts — and produces clean plain text. It also extracts per-page character count metadata used to map character offsets back to page numbers.

One limitation: Tika silently returns near-zero characters for pages whose content is purely graphical (component cards, illustrated setup diagrams, visual tables). These pages are missed entirely by text extraction alone.

Graphic-aware extraction for image-heavy pages

After Tika runs, the worker analyses each page individually. Pages are flagged when they have a high image-area ratio, contain several embedded images, or yield fewer characters than the document median. Flagged pages get a second extraction pass:

VLM transcription (default): the page is rendered to a PNG image and sent to GPT-4o via OpenRouter with a generic transcription prompt. No domain-specific vocabulary is used — the model reads whatever is on the page.
Tesseract OCR fallback: used when the VLM is unavailable or returns an error.

The supplemental text is labelled and merged into the Tika output. The database records which pages were recovered via vision (metadata.vlmPages) or OCR (metadata.ocrPages) and sets extractionMethod to 'tika+vlm' or 'tika+ocr'. If more than 60% of pages are flagged as image-heavy, the existing whole-document OCR path runs instead (treating the PDF as a fully scanned document).

This means component lists, card tables, and setup diagrams that Tika would have silently dropped are now part of the searchable text.

Chunking strategy

The extracted text is split hierarchically: first into coarse sections covering one rulebook topic, then into smaller passages of roughly 200–400 words within each section. Both levels are stored.

Retrieval targets the smaller passages — they are precise matches for a specific rule. But the answer is written from the whole surrounding section, so a rule keeps its context rather than arriving as an isolated fragment. This "match small, read big" approach means a buried sub-rule is less likely to be missed. Passage size adapts per rulebook: modular content like card lists gets finer passages; dense narrative prose gets coarser ones.

Table sections (component lists, card tables, numbered item lists) are detected structurally during the split — by column-delimiter patterns, repeated list shapes, and terse digit-bearing rows. A detected table region becomes one section that is never fragmented across the size cap, so the entire component list stays together. Table sections have a separate retrieval budget from prose sections, which prevents a large table from crowding out prose rules in the context window. This separate budget is what makes counting questions ("how many growth cards are there?") work reliably — see the COUNT_ENUMERATION category on the RAG pipeline page.

Why 768 dimensions?

The system uses jina-v2-small-en from the Jina AI embedding model family. 768-dimensional vectors hit a sweet spot: high enough semantic resolution to distinguish "when can I interrupt a player action" from "how do player actions work", small enough for fast indexing.

Numbers at a glance

Metric	Value
Rulebooks in library	3,300+
PDFs pending (waiting for first query)	~3,000 (normal)
Vector dimensions	768
Passage size	~200–400 words
Embedding model	jina-v2-small-en
Storage backend	PostgreSQL + pgvector
Max pages recovered per PDF via VLM/OCR	15