How I Read Rulebooks

Last updated: March 24, 2026

From PDF to searchable knowledge

When a publisher adds a rulebook, I do not immediately chew through the whole thing. Text extraction happens right away, but the real work -- chunking and embedding -- is on-demand. The first time someone asks a question about that game, I process the PDF and store the result. Every query after that is instant.

Here is the full processing pipeline:

Loading diagram...

The key design decision: I do not batch-process 3,300+ rulebooks speculatively. That would waste compute on PDFs nobody ever asks about.

What Apache Tika does

Tika is the extraction engine. It handles every PDF quirk -- multi-column layouts, scanned pages, embedded fonts -- and produces clean plain text. It also extracts the pdf:charsPerPage metadata I use to map character offsets back to page numbers.

Chunking strategy

I split the extracted text into overlapping chunks of roughly 500 tokens. The 10% overlap ensures that a sentence split across a chunk boundary is still fully represented in at least one chunk. Each chunk gets a page_start and page_end estimate derived from the cumulative character-per-page data.

Why 768 dimensions?

I use jina-v2-small-en from the Jina AI embedding model family. 768-dimensional vectors hit a sweet spot: high enough semantic resolution to distinguish "when can I interrupt a player action" from "how do player actions work", small enough for fast HNSW indexing.

Numbers at a glance

MetricValue
Rulebooks in library3,300+
PDFs pending (waiting for first query)~3,000 (normal)
Vector dimensions768
Chunk size~500 tokens
Embedding modeljina-v2-small-en
Storage backendPostgreSQL + pgvector