Planner

Status: TEMPLATE. Sections below are scaffolds for collaborative authoring. This doc covers the execution model — how query-engine turns a predicate AST into a fast scan.

Implemented today: header predicates + a first slice of position predicates (--queens-off, --bishop-pair, --count) ANDed under --min-streak, on the header-only and BB_ONLY paths, with game-bitmap output. Everything else here is target design. See README → Status.

The planner takes a PREDICATE-LANGUAGE AST and decides:

What order to evaluate predicates in
Which engine state (none / mailbox / hash) to maintain
When to materialize intermediate bitmaps
How to parallelize over shards

The goal: minimize wall-clock time for a correct match set, given the known cost structure of the corpus.

Why a planner

Cover: the two-tier cost asymmetry (headers ~7 ms, position scans ~360 ms). Without a planner, position-tier predicates run on the full corpus; with one, they run on the small surviving set after header filtering. The win is ~50× in the typical case.

Cost model

Cover: the per-predicate cost components and how they’re estimated.

Per-game cost decomposes into:

Header read cost — constant. ~7 ns per game (cache-friendly, 36-byte stride). Negligible variation between header predicates.
Move-replay cost — proportional to game length × per-ply work. Per-ply work depends on what engine state the predicate needs:
- None: just decode SMove16 (~3 ns/ply)
- Bitboards only: make_move<BB_ONLY> (~6 ns/ply)
- Bitboards + hash: make_move<WITH_HASH> (~10 ns/ply)
Per-predicate-evaluation cost — varies wildly. Cheap predicates (bishop_pair: ~2 popcounts) are ~5 cycles; expensive predicates (mate_in_one: full movegen) are thousands.

Predicate metadata declares its engine-state requirement and rough evaluation cost; the planner uses these.

Open question: how do we estimate selectivity? Options:

Static hints in PREDICATE-LIBRARY — each predicate declares approximate “matches X% of games” advisory.

Per-corpus profile — first run of a predicate writes its actual selectivity to a small cache; subsequent runs use that.

No estimate — just use predicate-author-supplied hints.

Predicate partitioning

Cover: how the planner splits the AST into tiers.

For an AST of the form and(p1, p2, p3, ...):

Partition: header_preds = [p for p in preds if isinstance(p, HeaderPred)], game_preds = [p for p in preds if isinstance(p, GamePred)].
Header tier: AND all header_preds into one combined HeaderPred.
Game tier: order game_preds (see below) and chain.

For non-AND root nodes (set algebra at top), the planner descends into each sub-scan independently. Set-algebra operations on resulting bitmaps happen post-scan.

Header-tier execution

Cover: the single shared header pass.

All header predicates evaluate in one pass over the corpus:

for each game:
    read 36-byte header
    if all_header_preds(header):
        set bit in survivors bitmap

This is bandwidth-bound (header bytes streaming through cache) at ~7 ms for 10M games. Order among header predicates doesn’t matter — they all evaluate per game regardless.

Output: a survivors bitmap (in-memory, one bit per game) holding the games that passed all header predicates.

Game-tier execution

Cover: the iterative chain over surviving games.

After the header tier, each game-level predicate runs in turn, each filtering the survivor set further:

for each game-pred in ordered(game_preds):
    new_survivors = empty bitmap
    for each game in survivors:
        if game-pred(game):
            set game in new_survivors
    survivors = new_survivors

Per-game-pred evaluation involves:

Mmap the game’s .moves record (via the .gameidx byte offset)
Replay plies with the cheapest engine state covering this predicate’s needs
Apply the predicate’s quantifier (Ever, Streak, AtPly, etc.) to the ply stream
Set the survivor bit iff the predicate’s quantifier accepts

Ordering game-tier predicates

Cover: how the planner orders them.

Each game-pred has:

Cost factor: how expensive per ply to evaluate
Selectivity hint: fraction of games matching Ever(P)

Greedy ordering: sort by cost / (1 - selectivity) ascending — i.e., “how many CPU cycles do I spend to eliminate one game?”

Open question: when selectivity hints are absent (new predicate, no data), fall back to alphabetical / definition order? Or run an initial cheap-sample pass to get rough numbers? Probably alphabetical for V1; revisit if it shows up as a real loss.

Engine-state maintenance

Cover: how the planner picks the cheapest engine state.

Each game-pred declares one of:

NONE — works on header alone (shouldn’t be a GamePred, but some quantifiers like AtPly with a header-only predicate could be)
BITBOARDS_ONLY — needs Board bitboards updated; uses make_move<BB_ONLY>
BITBOARDS_AND_HASH — needs zobrist hash too; uses make_move<WITH_HASH>
FULL_MOVES — needs the full Move struct + flags; uses make_move<ALL>

The planner picks the union of all surviving predicates’ requirements. As-built (2026-05-28): MaintenanceNeeds exists but only sets need_replay (true iff a position predicate is present); ScanPath is a two-path dispatch (header-only / BB_ONLY). WITH_HASH and the richer needs (need_hash, need_halfmove) are designed but unwired — and since no position predicate is currently wired, the BB_ONLY path is unreachable in practice. See README → Status.

Set algebra (post-scan)

Cover: cross-query composition. AND, OR, XOR, SUB, NOT on bitmaps.

When the AST root is a set-algebra node (e.g., or(scan(q1), scan(q2))), the planner:

Recursively evaluates each child scan, producing a bitmap.
Applies the set-algebra operation on the bitmaps.

These are the operations that used to live in bitmap-combine.c. As-built (2026-05-28): not implemented in query-engine. Set algebra lives only in the standalone bitmap-combine binary; the engine emits no bitmap to feed it yet. Target: the kernels move to libedge and are called both in-engine (set-op AST nodes, combined in memory) and by the bitmap-combine CLI.

Open question: do we evaluate child scans in parallel? They’re independent. For two-scan OR, parallelism wins. For a deep tree of set algebra, scheduling matters.

Materialization strategy

Cover: when to keep bitmaps in memory vs spill to disk.

For V1, the simple rule:

All bitmaps produced by a query stay in memory for the duration of the query invocation.
The user can request persistence via --output @name, which writes the final bitmap to <corpus>/bitmaps/<name>.bm.
Intermediate bitmaps in chained predicates are never persisted — they’re transient working state.

Open question: how do we handle “accelerator bitmaps” — pre-computed common subsets (e.g., the engine ships a precomputed @masters alongside the corpus)? They’d be persisted somewhere but not user-managed. (Separate design conversation per the architectural discussion; the planner just consumes whatever’s available.)

Parallelism

Cover: how the planner spawns threads.

Each scan over the corpus parallelizes across shards (one worker per shard, mmap’d independently). The header tier is embarrassingly parallel; the game tier is also per-shard parallel.

Within a single shard, scanning is sequential — games are walked in order.

Across set-algebra operations on independent bitmaps, the operations themselves are O(num_shards × bits_per_shard) byte-bashing — bandwidth- bound, vectorizable, single-threaded today (sub-millisecond at 10M scale, no need to parallelize).

Worked execution examples

Cover: 3-5 end-to-end traces of how a query becomes work.

Example 1: “Sicilian Najdorf, blitz, both 2500+, white wins, white had bishop pair sustained ≥10 plies starting from move 15.”

Plan:

Header tier (one pass): eco_in(B90-B99) ∧ tc_category(blitz) ∧ both_elo_ge(2500) ∧ result(W) → survivors bitmap, ~N_h games

Game tier (one pass over survivors): Streak(white_bishop_pair, 10) restricted to FromPly(30) Engine state: BB_ONLY (popcounts on bitboards) → final bitmap

Example 2: Set algebra. “Sicilian wins OR French wins, all by master.”

Plan:

Scan A: eco_in(B20-B99) ∧ result(W) ∧ both_elo_ge(2200) → bitmap A

Scan B: eco_in(C00-C19) ∧ result(W) ∧ both_elo_ge(2200) → bitmap B

Set-op: A ∪ B → final bitmap Open question: can the planner detect that scans A and B differ only in one predicate and fuse them into a single scan with a disjunctive header check? Yes for V2; not worth the complexity for V1.

Open design decisions

Open list:

Selectivity-hint source: static / dynamic / hybrid?

Empirical predicate-cost calibration (planner self-tunes)?

Parallelism of independent set-algebra sub-scans?

Predicate fusion (multiple scans with shared work merged into one)?

Predicate-output beyond bitmap (counts, histograms, position-list)?

Cancellation / streaming results (if a UI wants progress feedback)?