Data Format Reference

Wire-level specification of every artifact Edge reads or writes. This is the document you reach for when implementing a reader, debugging a malformed file, or extending the format.

Conventions:

All multi-byte integers are little-endian.
Sizes are bytes unless explicitly noted otherwise.
Bit numbering is LSB=0 unless stated.
All formats are stable at the indicated version; bumping the version field signals an incompatible change.

`.scoredb` directory layout

The on-disk representation of one indexed corpus. Each directory holds exactly one corpus.

<stem>.<YYYYMMDD-HHMMSS>.scoredb/
├── meta                  text, key=value
├── tree.dat              binary, aggregate edge table (optional, see below)
├── shards/
│   ├── 0.moves           binary, per-game game records (see below)
│   ├── 0.dict            text, newline-separated names
│   ├── 0.gameidx         binary, uint64 per game (offsets into 0.moves)
│   ├── 1.moves
│   ├── 1.dict
│   ├── 1.gameidx
│   └── ...
└── bitmaps/              optional, populated by user
    ├── <name>.bm         single bitmap, PMOTE-BM format
    └── ...

Directory naming: the default is <source-pgn-stem>.<YYYYMMDD-HHMMSS>.scoredb, producing a unique name per indexing run. --out PATH overrides.

Required entries: meta and shards/ must exist. tree.dat is emitted by indexer and required by explorer; tools that only need per-game data may ignore it. bitmaps/ is optional but is created empty by the indexer.

File extension note: the per-shard game-record files were named <N>.score prior to 2026-05; the extension was renamed to <N>.moves to better describe the contents (a stream of SMove16, not chess scores). Legacy .score files produced by pre-v14 indexers are readable only by the archived v1 scan engine (experiments/c-explorer/old/score-scan-v1.c); the current query-engine reads .moves only.

`meta` file format

Plain text, ASCII, key=value\n per line. Trailing whitespace is not trimmed; values are taken verbatim up to the newline. Unknown keys are ignored by readers.

version=2
num_shards=16
total_games=10164006
total_moves=199672940
source_pgn_stem=lichess_2017-02
source_pgn_size=9289131923
indexed_at=20260524-153045

Mandatory keys:

Key	Type	Meaning
`version`	int	Format version. Currently `2` (see “Version history” at bottom).
`num_shards`	int	Number of `<N>.moves` / `<N>.dict` / `<N>.gameidx` files under `shards/`.
`total_games`	int	Count of game records across all shards.
`total_moves`	int	Count of moves (plies) across all games.

Optional keys:

Key	Type	Meaning
`source_pgn_stem`	string	Basename of the source PGN without extension.
`source_pgn_size`	int	Byte size of the source PGN at index time.
`indexed_at`	string	Timestamp of the indexing run (`YYYYMMDD-HHMMSS`).

A reader needing only the shard count needs only version and num_shards.

`.moves` file format

Per-shard binary file holding game records. One .moves file per shard, named <shard-id>.moves under shards/. The shard IDs are contiguous integers from 0 to num_shards-1.

A .moves file is a flat concatenation of game records, no header, no padding between records, no per-shard footer.

Game record layout

Each game record is variable-size:

+--------------------------------+
|       ScoreHeader (36 B)       |
+--------------------------------+
|  moves[move_count] (2 B each)  |
+--------------------------------+

move_count is read from the header. The record length is 36 + 2 * move_count bytes. The next record begins immediately after.

ScoreHeader (36 bytes)

Offset	Size	Field	Type	Notes
0	8	`pgn_offset`	u64	Byte offset of `[Event` in the source PGN. Used to retrieve the original game text.
8	4	`move_count`	u32	Low 24 bits used; high 8 reserved. Range 0 to ~16.7 M.
12	4	`packed1`	u32	Result / termination / time control — see below.
16	2	`white_elo`	u16	13 bits used. Saturates at 8191 (0x1FFF). `0` = unknown.
18	2	`black_elo`	u16	Same as white_elo.
20	2	`eco`	u16	10 bits used. `0..499` maps to `A00..E99`. `1023` (0x3FF) = unknown.
22	2	`pad0`	u16	Reserved (zero).
24	4	`date_packed`	u32	Year/month/day packed — see below.
28	4	`white_name_idx`	u32	Index into this shard’s `.dict`. `0xFFFFFFFF` = unknown.
32	4	`black_name_idx`	u32	Same as white_name_idx.

Total: 36 bytes. Naturally aligned to 4 bytes; not aligned to 8.

`packed1` bit layout (32 bits)

Bits	Field	Width	Range / Encoding
0–2	result	3	`0=*` (unknown), `1=1-0`, `2=1/2-1/2`, `3=0-1`, others reserved
3–5	termination	3	`0=Normal`, `1=Time forfeit`, `2=Abandoned`, `3=Rules infraction`, `7=Other`, 4-6 reserved
6–18	tc_base_seconds	13	Base time in seconds; saturates at 8191 (`0x1FFF`)
19–24	tc_increment	6	Increment per move in seconds; saturates at 63 (`0x3F`)
25	tc_is_correspondence	1	`1` if TC string was `-` or base saturated
26–31	reserved	6	Zero

Time-control category is derived from packed1 at query time:

if (tc_is_correspondence) -> correspondence
total = tc_base_seconds + 60 * tc_increment
if total <  180  -> bullet
if total <  600  -> blitz
if total < 3600  -> rapid
else             -> classical

`date_packed` bit layout (32 bits)

Bits	Field	Width	Range
0–11	year	12	0–4095 (0 = unknown)
12–15	month	4	1–12 (0 = unknown)
16–20	day	5	1–31 (0 = unknown)
21–31	reserved	11	Zero

date_packed == 0 indicates a fully unknown date.

`eco` field encoding

10-bit unsigned integer, range 0..499 representing ECO codes A00..E99:

letter = 'A' + (eco / 100)      // 'A' .. 'E'
digits = eco % 100              // 00 .. 99

So B22 is (1 * 100) + 22 = 122. Value 1023 (0x3FF) reserved for “ECO unknown / not present in PGN.”

Range queries (--eco B20-B99) compile to inclusive [122, 199].

Move encoding (SMove16)

Each move in the moves[] array is 16 bits:

Bits	Field	Notes
0–5	from square	0–63, standard chess square encoding (a1=0, h1=7, a8=56, h8=63)
6–11	to square	0–63, same encoding
12–13	promotion	`0=none`, `1=knight`, `2=bishop`, `3=rook`. Queen promotion uses flags=3 (see below).
14–15	flags	`0=normal`, `1=castle`, `2=en-passant capture`, `3=queen promotion`

Square encoding: standard 0–63 with square = rank * 8 + file, where rank=0 is white’s first rank. Engines that internally use 0x88 must convert.

Promotion encoding: the promotion type uses two distinct paths:

For knight/bishop/rook: flags = 0, promo = 1..3.
For queen: flags = 3, promo is ignored (set to 0 by convention).

This compresses the common case (queen promotion is the dominant choice) and frees promo == 0 && flags == 0 for “no promotion.”

Castling: flags = 1. The from and to are the king’s squares; the rook movement is derived (king-side castling: rook from h-file to f-file; queen-side: a-file to d-file).

En passant capture: flags = 2. from and to are the capturing pawn’s squares; the captured pawn is on the same rank as from, same file as to.

`.dict` file format

Plain UTF-8 text, one name per line, no header, no escaping. The line number (0-based) is the index value referenced from white_name_idx / black_name_idx in the ScoreHeader.

Example:

DrNykterstein
Magnus Carlsen
opperwezen
penguingm1
nihalsarin2004

Dictionaries are per-shard. Name index 42 in shard 3 refers to line 42 of shards/3.dict, not to a global position. To search by name across the corpus, the scan engine looks up the query string in each shard’s dict independently.

Names are stored verbatim. Unicode is preserved. No case-folding, no whitespace normalization. Matching is exact-string at query time. A name that appears as both "Magnus Carlsen" and "Carlsen, Magnus" in the source PGN will have two separate dict entries.

`.gameidx` file format

Per-shard binary sidecar containing one uint64_t per game in the shard: the byte offset of that game’s ScoreHeader within the shard’s .moves file. The flat layout enables O(matches) random access by game-id for bitmap-filtered scans and arbitrary parallel chunking.

+-------------------------------+
|  uint64_t[game_count]         |
|  (game 0 offset, game 1, ...) |
+-------------------------------+

The file size is exactly 8 * game_count bytes; no header, no padding. game_count for each shard can be derived by dividing the file size by 8, or read from the corresponding entry in the bitmap’s shard table (the two must agree).

Offsets are byte offsets relative to the start of <N>.moves. The first record’s offset is always 0; the last record’s offset plus its own 36 + 2*move_count bytes equals the file size of <N>.moves.

indexer emits .gameidx by default. Pre-v14 indexers (now archived) did not emit this sidecar; the archived v1 scan engine falls back to scanning headers sequentially in that case. The current query-engine assumes .gameidx is present (the indexer always emits it).

`.bm` bitmap file format

One bitmap per file. Each bitmap is a set of matched games, dense per-shard, sharded to match the corpus.

File layout

+---------------------------------+
|       Header (32 bytes)         |
+---------------------------------+
|  Shard table (16 B × N shards)  |
+---------------------------------+
|       Bits, shard 0             |
|       Bits, shard 1             |
|       ...                       |
|       Bits, shard N-1           |
+---------------------------------+

Header (32 bytes)

Offset	Size	Field	Notes
0	8	magic	ASCII `"PMOTE-BM"` (exactly 8 bytes, no terminator)
8	4	version	u32, currently `1`
12	4	num_shards	u32
16	8	total_games	u64, summed across all shards
24	8	reserved	u64, zero

Shard table (16 bytes per shard)

Offset	Size	Field	Notes
0	4	shard_id	u32
4	4	game_count	u32, games in this shard
8	8	bits_offset	u64, byte offset from file start to this shard’s bits

Entries are in shard_id order. The shard table immediately follows the 32-byte header (so it starts at offset 32).

Bits payload

Per shard, ceil(game_count / 8) bytes. Within each byte, bit 0 (LSB) corresponds to the lowest game index in the shard, bit 1 to the next, etc. Game index 0 in a shard is the first game record in that shard’s .moves file.

bit_value(game_index_in_shard):
  byte = game_index_in_shard / 8
  bit  = game_index_in_shard % 8
  return (bits[byte] >> bit) & 1

Bits beyond game_count (in the trailing partial byte) must be zero. Readers can assume this; writers must ensure it.

Bitmap semantics

A set bit means “this game matches the predicate the bitmap represents.” Bitmaps are per-corpus — a bitmap produced from one .scoredb is not portable to another .scoredb, even of the same source PGN, because shard counts and game indices depend on indexing parameters.

Set algebra constraints

bitmap-combine operations (and, or, xor, sub, not) require the operand bitmaps to share identical shard layout: same num_shards, same per-shard game_count. Bitmaps from different corpora cannot be combined.

not produces the complement up to each shard’s game_count — bits beyond the last game stay zero per the spec.

`@name` bitmap references

A bitmap argument starting with @ resolves to a path inside the corpus directory:

@sicilian   →   <corpus>/bitmaps/sicilian.bm
@killer-A   →   <corpus>/bitmaps/killer-A.bm

Names may contain alphanumerics and -, _. No / or . in names — they’d break the resolution.

Outside a .scoredb context, @name references fall back to literal file paths (with a warning).

Bitmap density and encoding choice

The current encoding is dense — every game contributes a bit whether matched or not. For 10 M games this is 1.27 MB per bitmap regardless of how many games are actually set.

Tradeoff with sparse encodings (Roaring, sorted-ID lists): dense is faster for the prefilter-scan hot path (sequential reads, branch prediction). Sparse is smaller for very-low-density bitmaps. We chose dense because the dominant access pattern is “scan corpus, test bit per game,” which favors dense memory layout.

This is a single localized decision; the format can grow to support encoding flags in a future version without disrupting the rest of the system. For now, no encoding flag is stored in the header; v1 implies dense.

`tree.dat` file format

Single file at the scoredb root holding the aggregate edge table: one row per unique (parent_hash, move_id) observed across the corpus, with W/D/B counts summed over all games that played that move from that position.

The file is produced by indexer at indexer finalize and consumed by explorer.

File layout

+------------------------------------+
|       Header (32 bytes)            |
+------------------------------------+
|  Bucket offsets (4 × (nbuckets+1)) |
+------------------------------------+
|       Edge[total_edges]            |
|       packed by bucket             |
+------------------------------------+

Header (32 bytes)

Offset	Size	Field	Notes
0	8	magic	ASCII `"EDGE_TRE"` (exactly 8 bytes, no terminator)
8	4	version	u32, currently `1`
12	4	nbuckets	u32. Must be a power of 2. Currently `1024`.
16	8	total_edges	u64, total number of Edge records
24	8	reserved	u64, zero

Bucket offset table

uint32[nbuckets+1] cumulative edge counts (not byte offsets). Bucket b spans [offsets[b], offsets[b+1]) in the Edge array. offsets[0] == 0 and offsets[nbuckets] == total_edges.

Edge record (28 bytes; payload stride is 32 with C padding)

Offset	Size	Field	Notes
0	8	parent_hash	u64. Zobrist of position before the move.
8	4	move_id	u32. SMove16 in low 16 bits; high bits zero.
12	4	total	u32. Game count summing W+D+B.
16	4	w	u32. Games White won from this edge.
20	4	d	u32. Games drawn from this edge.
24	4	b	u32. Games Black won from this edge.

Total: 28 bytes; the C struct pads to 32 bytes on write, so the on-disk record stride is 32. move_id decoding follows the standard SMove16 layout (see Move encoding above).

Lookup procedure

uint32_t bucket_id = position_hash & (nbuckets - 1);
uint32_t lo = offsets[bucket_id];
uint32_t hi = offsets[bucket_id + 1];

for (uint32_t i = lo; i < hi; i++) {
    if (edges[i].parent_hash == position_hash) {
        // matching move from this position; emit edges[i]
    }
}

Within a bucket the Edges are not sorted by parent_hash — they appear in hash-table iteration order. The bucket linear scan is fast in practice (~73K edges per bucket at 10 M scale → ~0.15 ms warm) because the access pattern streams sequentially through 28-byte records. A future format version may sort within buckets to enable binary-search lookup; see ROADMAP.md.

Bucket size guidance

nbuckets = 1024 was chosen so each bucket holds ~73K edges at 10 M-game scale (75M edges / 1024). Smaller corpora have proportionally smaller buckets and faster scans. Larger corpora (>30 M games) may warrant raising nbuckets; the format supports any power of 2.

Position hashing convention

The parent_hash field is the engine’s Zobrist hash of the position before the move was played. Engines reading tree.dat for lookup must compute their hash compatibly:

Standard piece-square Zobrist keys.
Castling-rights, side-to-move, and en-passant-square contributions follow the engine in position.hpp.
FIDE-2024+ EP semantics: the EP-square contribution is only XOR’d in when an enemy pawn can legally capture en passant. Two positions that differ only in whether their FEN happens to record an EP “ghost” hash identically. This makes transpositions like 1.Nf3 d5 2.d4 ≡ 1.d4 d5 2.Nf3 aggregate correctly in the tree.

Zobrist key tables are deterministic (seeded), so cross-machine hashes match for a given engine version.

Backward compatibility and legacy paths

For corpora indexed prior to the .scoredb directory format, the archived v1 scan engine (experiments/c-explorer/old/score-scan-v1.c) accepts a legacy stem path:

Stem path lichess_2017-02.pgn resolves to lichess_2017-02.pgn.t<N>.score and lichess_2017-02.pgn.t<N>.score.dict files in the same directory.
Note: the legacy format kept the original .score extension; only the modern .scoredb layout uses .moves.
Shard count defaults to 12 (the legacy default).

Legacy paths have no meta file and no per-corpus bitmap directory. The @name syntax is not supported in legacy mode.

The current query-engine reads only .scoredb directories. Legacy support is preserved in the archived v1 source for reference and for one-off use against old corpora; new work targets .scoredb.

Version history and stability

.scoredb v2 (meta version=2, current): stable. Edge.move_id holds an SMove16 in its low 16 bits (from indexer onward); earlier v1 stored an FNV-1a hash of the SAN string in the same slot — same byte width, different semantics.
.scoredb v1 (meta version=1): legacy. Format-compatible byte layout but readers must treat the san_hash slot as opaque (no SMove16 decoding).
tree.dat v1: stable. Bumping requires re-indexing.
PMOTE-BM v1: stable. Same guarantee.
The SMove16 move encoding: stable.

Bumping a version field signals an incompatible change that requires re-indexing. Tools must check the version field and refuse unknown versions.

The 2026-05 .score → .moves extension rename did not bump the scoredb version: only the filename on disk changed; the byte layout is identical.

Reading a `.scoredb` from scratch (sketch)

// 1. Read meta file, parse num_shards.
int num_shards = read_meta_value(corpus_dir, "num_shards");

// 2. For each shard:
for (int s = 0; s < num_shards; s++) {
    char path[1024];
    snprintf(path, sizeof(path), "%s/shards/%d.moves", corpus_dir, s);
    int fd = open(path, O_RDONLY);
    struct stat st; fstat(fd, &st);
    uint8_t *data = mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
    close(fd);

    // 3. Walk records.
    const uint8_t *p = data, *end = data + st.st_size;
    while (p < end) {
        ScoreHeader *h = (ScoreHeader *)p;
        uint32_t mc = h->move_count & 0xFFFFFF;
        const uint16_t *moves = (const uint16_t *)(p + 36);

        // ... process header + moves ...

        p += 36 + 2 * mc;
    }
}

This is the entire reader. No external dependencies, no parsing beyond field unpacking.

Data Format Reference

.scoredb directory layout

meta file format

.moves file format

Game record layout

ScoreHeader (36 bytes)

packed1 bit layout (32 bits)

date_packed bit layout (32 bits)

eco field encoding

Move encoding (SMove16)

.dict file format

.gameidx file format

.bm bitmap file format

File layout

Header (32 bytes)

Shard table (16 bytes per shard)

Bits payload

Bitmap semantics

Set algebra constraints

@name bitmap references

Bitmap density and encoding choice

tree.dat file format

File layout

Header (32 bytes)

Bucket offset table

Edge record (28 bytes; payload stride is 32 with C padding)

Lookup procedure

Bucket size guidance

Position hashing convention

Backward compatibility and legacy paths

Version history and stability

Reading a .scoredb from scratch (sketch)

`.scoredb` directory layout

`meta` file format

`.moves` file format

`packed1` bit layout (32 bits)

`date_packed` bit layout (32 bits)

`eco` field encoding

`.dict` file format

`.gameidx` file format

`.bm` bitmap file format

`@name` bitmap references

`tree.dat` file format

Reading a `.scoredb` from scratch (sketch)