Skip to content

Use Cases

Worked examples and recipes. Each section starts with the question you’re trying to answer, then walks through the queries that answer it.

For per-flag semantics see QUERY-LANGUAGE.md. For the architectural concepts behind these recipes see ARCHITECTURE.md.

API maturity note. Recipes below mix shipped and target API. Much of the core is now live: the composite position predicates (--queens-off, --bishop-pair, --doubled-pawn, --isolated-pawn, --passed-pawn, --rook-on-seventh, --rook-on-open-file, --king-castled, and --count EXPR), side/window quantifiers, 1-step-past edges (--became/--ceased), in-engine boolean composition (AND via --cond/--and, OR via --or, NOT via --not), bitmap I/O (--output/--input-bitmap), and the standalone bitmap-combine. Still aspirational and labeled inline below where used: --subfen (arbitrary sub-position matching) and the aggregation reducers (square-heatmap, match-record streaming). The v1 source (archived at experiments/c-explorer/old/score-scan-v1.c) remains the reference for predicate semantics still being ported. See ROADMAP.md for the current implementation status.

Throughout, $CORPUS refers to your indexed corpus, e.g. lichess_2017-02.20260524-153045.scoredb.

Terminal window
# One-time: index a PGN. Output goes into a timestamped .scoredb directory.
indexer lichess_2017-02.pgn
# Look at what got produced (just inspect the directory directly).
ls -lh lichess_2017-02.20260524-153045.scoredb/
cat lichess_2017-02.20260524-153045.scoredb/meta

The indexer prints summary stats, shard counts, and the path of the created .scoredb. Save the path for subsequent queries.

For convenience, assign it to a variable for the examples below:

Terminal window
CORPUS=lichess_2017-02.20260524-153045.scoredb

“What’s the world play after 1. e4 c5 in this corpus?”

The classic opening-explorer question. For an unfiltered look at the full corpus, query the persisted move tree directly:

Terminal window
# Position after 1. e4 c5: ask for the moves played from here and counts.
explorer $CORPUS --fen "rnbqkbnr/pp1ppppp/8/2p5/4P3/8/PPPP1PPP/RNBQKBNR w KQkq - 0 2"
# g1f3 1432104 games (W:51.2% D:4.3% B:44.5%)
# d2d4 345287 games (W:47.8% D:5.1% B:47.1%)
# ...

Result lookup is ~0.15 ms at 10 M-game scale; the entire tree (tree.dat) is mmapped at startup. Output is UCI today (SAN is on the roadmap).

For a filtered look (e.g., “what does the world play after 1.e4 c5 in master games only?”), the current path is to scan-filter first and then either re-index a filtered subcorpus or use a (future) bitmap-aware tree-builder pass:

Terminal window
# Find all games tagged Sicilian by ECO.
query-engine $CORPUS --eco B20-B99
# → 1,137,517 games matched

Save it as a reusable bitmap:

Terminal window
query-engine $CORPUS --eco B20-B99 --output @sicilian

Now ask narrower questions:

Terminal window
# Sicilians in long time controls
query-engine $CORPUS --input-bitmap @sicilian --tc-category classical
# → small subset (Lichess is mostly fast TC)
# Sicilians played by 2200+ on both sides
query-engine $CORPUS --input-bitmap @sicilian --master
# → ~16,000 games

Each filtered query takes ~50 ms because the input bitmap pre-filters to the cheapest possible stage.

“In rook-pawn endgames where the two sides had asymmetric pawn counts, which games did the side with fewer pawns win?”

This is the validation case Edge was designed around. It combines:

  • A structural condition (only K, R, P on the board; equal rook counts ≥ 1; asymmetric pawn counts)
  • A min-streak to filter transient capture-recapture states
  • A result-conditional asymmetry (“the side with fewer pawns won”)

The two branches pair a different result with each pawn-asymmetry (white-fewer + 1-0 vs. black-fewer + 0-1). Because --result is a game-level header filter that applies to the whole query — it can’t vary per OR-branch — this case doesn’t collapse into a single in-engine --or. The natural decomposition is two scans plus a bitmap union, which also leaves each branch persisted as a reusable bitmap.

Terminal window
# Branch A: white has fewer pawns, white wins
query-engine $CORPUS \
--count "QBNqbn=0" \
--count "R=r" --count "R>=1" \
--count "P<p" \
--result 1-0 \
--min-streak 5 \
--output @killer-A
# Branch B: black has fewer pawns, black wins
query-engine $CORPUS \
--count "QBNqbn=0" \
--count "R=r" --count "R>=1" \
--count "P>p" \
--result 0-1 \
--min-streak 5 \
--output @killer-B
# Union them
bitmap-combine or $CORPUS/bitmaps/killer-A.bm \
$CORPUS/bitmaps/killer-B.bm \
-o $CORPUS/bitmaps/killer.bm

On the 10 M Lichess corpus: ~150,000 matches, ~720 ms total.

This is a query no other chess tool can answer at this scale: no mainstream chess database supports result-correlated structural predicates with sustained-pattern semantics.

Restrict to high-rated games:

Terminal window
# Two-step: build the precondition bitmaps separately
query-engine $CORPUS --master --output @masters
# Then run the killer query with that as a pre-filter
query-engine $CORPUS --input-bitmap @masters \
--count "QBNqbn=0" --count "R=r" --count "R>=1" --count "P<p" \
--result 1-0 --min-streak 5 --output @killer-A-masters

“How often does the Carlsbad pawn structure appear?”

Carlsbad pawn structure (from the Queen’s Gambit Declined Exchange): white pawns on a2, b2, c3, d4, e3, f2, g2, h2; black pawns on a7, b7, c6, d5, e6, f7, g7, h7 — though typically defined by just the characteristic c3/d4 vs c6/d5 + pawn islands.

Use sub-FEN with placement only (planned — --subfen is not yet built; it’s the next position predicate on the roadmap):

Terminal window
query-engine $CORPUS --subfen "8/pp3ppp/2p1p3/3p4/3P4/2P1P3/PP3PPP/8" --min-streak 5

The placement string fixes the pawn structure. Non-named squares (kings, other pieces) are unconstrained. --min-streak 5 ensures the structure persists for at least 5 consecutive plies, filtering games where pawns are momentarily there mid-exchange.

For a sustained structure, longer streaks (10–20 plies) tighten the filter further.

Terminal window
# Carlsbad structures in master games
query-engine $CORPUS --master \
--subfen "8/pp3ppp/2p1p3/3p4/3P4/2P1P3/PP3PPP/8" --min-streak 5 \
--output @carlsbad-master

“In games with the bishop pair imbalance, who tends to win?”

“Bishop pair imbalance” = one side has two bishops, the other has at most one. Two branches by which side holds it. Because both branches filter on position predicates only (no differing header filter), this collapses into a single in-engine query using --or to OR two AND-groups:

Terminal window
# Either side holds the bishop-pair imbalance, in one pass.
# Group 0 = (B>=2 AND b<2); --or; group 1 = (b>=2 AND B<2).
query-engine $CORPUS \
--count "B>=2" --count "b<2" \
--or \
--count "b>=2" --count "B<2" \
--output @bp-imbalance

If you also want each side’s set kept separately (e.g. to tally results per branch, below), run them as two scans and union the persisted bitmaps instead:

Terminal window
# White has the pair, black doesn't
query-engine $CORPUS --count "B>=2" --count "b<2" --output @bp-white
# Black has the pair, white doesn't
query-engine $CORPUS --count "b>=2" --count "B<2" --output @bp-black
# Union (equivalent to the single --or pass above)
bitmap-combine or $CORPUS/bitmaps/bp-white.bm $CORPUS/bitmaps/bp-black.bm \
-o $CORPUS/bitmaps/bp-imbalance.bm

Then count results in each subset:

Terminal window
query-engine $CORPUS --input-bitmap @bp-white --result 1-0 # → white wins with the pair
query-engine $CORPUS --input-bitmap @bp-white --result 0-1 # → white loses despite the pair
query-engine $CORPUS --input-bitmap @bp-white --result 1/2-1/2 # → draws

Result percentages over the imbalance set inform the question.

“Build a Sicilian Najdorf reference set, filtered to high-rated games with sustained endgame play.”

A pipeline of three to four bitmaps composed via set algebra:

Terminal window
# Step 1: Sicilians
query-engine $CORPUS --eco B20-B99 --output @sicilian
# Step 2: Specifically the Najdorf (B90-B99)
query-engine $CORPUS --eco B90-B99 --output @najdorf
# Step 3: High-rated games
query-engine $CORPUS --master --output @master
# Step 4: Games that reach a queenless position for 5+ plies
query-engine $CORPUS \
--count "Q=0" --count "q=0" --min-streak 5 \
--output @queenless
# Compose
bitmap-combine and $CORPUS/bitmaps/najdorf.bm $CORPUS/bitmaps/master.bm \
-o $CORPUS/bitmaps/najdorf-master.bm
bitmap-combine and $CORPUS/bitmaps/najdorf-master.bm $CORPUS/bitmaps/queenless.bm \
-o $CORPUS/bitmaps/najdorf-master-endgame.bm

Each step is a precomputed reusable building block. Mix and match later without re-scanning.

“Find positions where white has a passed pawn on d6 supported by a knight.”

Sub-FEN with the specific squares named (planned — --subfen is not yet built; see the roadmap):

Terminal window
query-engine $CORPUS --subfen "rnbqkbnr/pp1ppppp/8/8/8/3P4/PPP1PPPP/RNBQKBNR" --min-streak 1

Refine with material:

Terminal window
# Make sure black doesn't have a pawn on c7 (which would defend against d6)
# This requires negative position evidence, which sub-FEN doesn't currently express.
# Workaround: use --count to require black pawn count < 8.

(Negative position predicates — “no piece on square X” — are a known gap. Tracked in ROADMAP.md.)

Most aggregations belong outside Edge. The pattern is:

  1. Edge scan emits a bitmap of matching games.
  2. App iterates the bitmap, loads matched games via .moves / .dict reads (or via the source PGN’s pgn_offset field), and computes whatever statistics it wants.

This keeps Edge focused on filtering — the expensive part at scale — and lets app code use chess-typed data structures (move trees, board states with annotations) for the aggregation.

Exceptions: tree builder and planned in-scan aggregations

Section titled “Exceptions: tree builder and planned in-scan aggregations”

Two aggregations live in-engine because they’re chess workhorses and they get a meaningful speedup from being computed during the scan:

The tree builder computes the aggregate edge table during indexing and persists it as tree.dat at the scoredb root. For unfiltered queries (“what does the world play at this position?”), explorer looks up the position hash and returns the moves with W/D/B counts in ~0.15 ms at 10 M-game scale.

For filtered queries (“what do GM Sicilians play at this position?”), the tree builder runs over the filtered subset of games, using the bitmap as a pre-filter:

Terminal window
# Planned interface, not yet built:
#
# explorer $CORPUS --input-bitmap @gm-sicilian --position FEN

Both paths use the same reduce machinery; only the input differs.

“Where does white put knights in the Carlsbad pawn structure?”

A heatmap is a 64-element array of “how often was piece X on square Y across all matching positions.” Cheap to maintain during the scan (~30 lines of inner-loop code, ~256 bytes output).

Terminal window
# Planned interface, not yet built:
#
# query-engine $CORPUS \
# --subfen "8/pp3ppp/2p1p3/3p4/3P4/2P1P3/PP3PPP/8" \
# --aggregate square-heatmap N \
# --output-aggregate heatmap.bin

The output is a 256-byte file (64 u32 counters); apps render it as a chessboard with square shading.

For aggregations beyond the canonical two, Edge can stream (shard_id, game_id, ply, board_hash) per matching position. Apps consume the stream and aggregate however they like.

# Planned interface, not yet built:
#
# query-engine $CORPUS [rules] --output-match-records records.bin

Apps can then mmap each matched game’s record and replay to the indicated ply for further data extraction. This is slower than baked-in aggregation but works for any aggregation shape.

After a scan, you have:

  • A count of matched games (printed)
  • First-N PGN offsets per shard (printed)
  • An output bitmap (if requested)

To display matches:

Terminal window
# Get a few matched-game PGN offsets
query-engine $CORPUS --input-bitmap @killer 2>&1 | grep "shard"
# shard 0: 6297 12317 22555 ...
# shard 1: 774095634 774097748 ...

Then extract the games from the source PGN:

Terminal window
# Seek to offset 6297 in the original PGN to find the matched game.
# Apps would mmap the PGN and parse from there with Tabia.

The pgn_offset field points at the [Event ...] header of the matched game.

To iterate all matches, walk the bitmap. (A small score-extract utility for this is on the roadmap.)

A .scoredb is a plain directory; standard Unix tools work. (A dedicated corpus-admin utility is deferred — see ROADMAP.md.)

Terminal window
# What's in this corpus?
cat $CORPUS/meta
ls -lh $CORPUS/shards/
ls -lh $CORPUS/bitmaps/
# What bitmaps have been stored?
ls $CORPUS/bitmaps/*.bm 2>/dev/null
# → killer.bm sicilian.bm master.bm
# (or use `bitmap-combine info <file>` for header details)
# Remove a stored bitmap
rm $CORPUS/bitmaps/sicilian.bm

.scoredb is a directory. You can rm -rf it, rsync it, tar it like any other filesystem artifact.

  • Single-game playback / interactive analysis. That’s Tabia + Rabbit’s domain. Edge strips variations and annotations on ingest; if you need those, parse the original PGN with Tabia.
  • Live game state / move generation. Edge replays moves but doesn’t validate them (the indexer trusts the PGN). For playing positions or generating legal moves, use a chess engine.
  • Small corpora (< 1000 games). Edge will work, but the per-PGN startup cost dominates. A direct PGN parse with Tabia is simpler.
  • Annotation-aware queries. “Games where the commentator wrote ’!!’ for white’s 18th move” — Edge can’t answer this; the annotations were stripped at ingest. Parse the PGN with Tabia or use a different tool.

Benchmarks on 10 M Lichess corpus, M4 Max, 12 cores, warm cache. Use these to spot when something looks off:

OperationExpected wall
Indexing~1.8 s
Metadata filter~50 ms
Single-rule position scan~250–350 ms
Multi-rule position scan~300–500 ms
Same with --input-bitmap at 10% density~50–80 ms
Same with --input-bitmap at 1% density~30–50 ms
bitmap-combine set algebra< 10 ms

If a metadata filter takes seconds, you’re cold-cache. If a position scan takes minutes, something’s wrong with the corpus or shard layout — check cat $CORPUS/meta and ls -lh $CORPUS/shards/.