Skip to content

Data Format Reference

Wire-level specification of every artifact Edge reads or writes. This is the document you reach for when implementing a reader, debugging a malformed file, or extending the format.

Conventions:

  • All multi-byte integers are little-endian.
  • Sizes are bytes unless explicitly noted otherwise.
  • Bit numbering is LSB=0 unless stated.
  • All formats are stable at the indicated version; bumping the version field signals an incompatible change.

The on-disk representation of one indexed corpus. Each directory holds exactly one corpus.

<stem>.<YYYYMMDD-HHMMSS>.scoredb/
├── meta text, key=value
├── tree.dat binary, aggregate edge table (optional, see below)
├── shards/
│ ├── 0.moves binary, per-game game records (see below)
│ ├── 0.dict text, newline-separated names
│ ├── 0.gameidx binary, uint64 per game (offsets into 0.moves)
│ ├── 1.moves
│ ├── 1.dict
│ ├── 1.gameidx
│ └── ...
└── bitmaps/ optional, populated by user
├── <name>.bm single bitmap, PMOTE-BM format
└── ...

Directory naming: the default is <source-pgn-stem>.<YYYYMMDD-HHMMSS>.scoredb, producing a unique name per indexing run. --out PATH overrides.

Required entries: meta and shards/ must exist. tree.dat is emitted by indexer and required by explorer; tools that only need per-game data may ignore it. bitmaps/ is optional but is created empty by the indexer.

File extension note: the per-shard game-record files were named <N>.score prior to 2026-05; the extension was renamed to <N>.moves to better describe the contents (a stream of SMove16, not chess scores). Legacy .score files produced by pre-v14 indexers are readable only by the archived v1 scan engine (experiments/c-explorer/old/score-scan-v1.c); the current query-engine reads .moves only.

Plain text, ASCII, key=value\n per line. Trailing whitespace is not trimmed; values are taken verbatim up to the newline. Unknown keys are ignored by readers.

version=2
num_shards=16
total_games=10164006
total_moves=199672940
source_pgn_stem=lichess_2017-02
source_pgn_size=9289131923
indexed_at=20260524-153045

Mandatory keys:

KeyTypeMeaning
versionintFormat version. Currently 2 (see “Version history” at bottom).
num_shardsintNumber of <N>.moves / <N>.dict / <N>.gameidx files under shards/.
total_gamesintCount of game records across all shards.
total_movesintCount of moves (plies) across all games.

Optional keys:

KeyTypeMeaning
source_pgn_stemstringBasename of the source PGN without extension.
source_pgn_sizeintByte size of the source PGN at index time.
indexed_atstringTimestamp of the indexing run (YYYYMMDD-HHMMSS).

A reader needing only the shard count needs only version and num_shards.

Per-shard binary file holding game records. One .moves file per shard, named <shard-id>.moves under shards/. The shard IDs are contiguous integers from 0 to num_shards-1.

A .moves file is a flat concatenation of game records, no header, no padding between records, no per-shard footer.

Each game record is variable-size:

+--------------------------------+
| ScoreHeader (36 B) |
+--------------------------------+
| moves[move_count] (2 B each) |
+--------------------------------+

move_count is read from the header. The record length is 36 + 2 * move_count bytes. The next record begins immediately after.

OffsetSizeFieldTypeNotes
08pgn_offsetu64Byte offset of [Event in the source PGN. Used to retrieve the original game text.
84move_countu32Low 24 bits used; high 8 reserved. Range 0 to ~16.7 M.
124packed1u32Result / termination / time control — see below.
162white_elou1613 bits used. Saturates at 8191 (0x1FFF). 0 = unknown.
182black_elou16Same as white_elo.
202ecou1610 bits used. 0..499 maps to A00..E99. 1023 (0x3FF) = unknown.
222pad0u16Reserved (zero).
244date_packedu32Year/month/day packed — see below.
284white_name_idxu32Index into this shard’s .dict. 0xFFFFFFFF = unknown.
324black_name_idxu32Same as white_name_idx.

Total: 36 bytes. Naturally aligned to 4 bytes; not aligned to 8.

BitsFieldWidthRange / Encoding
0–2result30=* (unknown), 1=1-0, 2=1/2-1/2, 3=0-1, others reserved
3–5termination30=Normal, 1=Time forfeit, 2=Abandoned, 3=Rules infraction, 7=Other, 4-6 reserved
6–18tc_base_seconds13Base time in seconds; saturates at 8191 (0x1FFF)
19–24tc_increment6Increment per move in seconds; saturates at 63 (0x3F)
25tc_is_correspondence11 if TC string was - or base saturated
26–31reserved6Zero

Time-control category is derived from packed1 at query time:

if (tc_is_correspondence) -> correspondence
total = tc_base_seconds + 60 * tc_increment
if total < 180 -> bullet
if total < 600 -> blitz
if total < 3600 -> rapid
else -> classical
BitsFieldWidthRange
0–11year120–4095 (0 = unknown)
12–15month41–12 (0 = unknown)
16–20day51–31 (0 = unknown)
21–31reserved11Zero

date_packed == 0 indicates a fully unknown date.

10-bit unsigned integer, range 0..499 representing ECO codes A00..E99:

letter = 'A' + (eco / 100) // 'A' .. 'E'
digits = eco % 100 // 00 .. 99

So B22 is (1 * 100) + 22 = 122. Value 1023 (0x3FF) reserved for “ECO unknown / not present in PGN.”

Range queries (--eco B20-B99) compile to inclusive [122, 199].

Each move in the moves[] array is 16 bits:

BitsFieldNotes
0–5from square0–63, standard chess square encoding (a1=0, h1=7, a8=56, h8=63)
6–11to square0–63, same encoding
12–13promotion0=none, 1=knight, 2=bishop, 3=rook. Queen promotion uses flags=3 (see below).
14–15flags0=normal, 1=castle, 2=en-passant capture, 3=queen promotion

Square encoding: standard 0–63 with square = rank * 8 + file, where rank=0 is white’s first rank. Engines that internally use 0x88 must convert.

Promotion encoding: the promotion type uses two distinct paths:

  • For knight/bishop/rook: flags = 0, promo = 1..3.
  • For queen: flags = 3, promo is ignored (set to 0 by convention).

This compresses the common case (queen promotion is the dominant choice) and frees promo == 0 && flags == 0 for “no promotion.”

Castling: flags = 1. The from and to are the king’s squares; the rook movement is derived (king-side castling: rook from h-file to f-file; queen-side: a-file to d-file).

En passant capture: flags = 2. from and to are the capturing pawn’s squares; the captured pawn is on the same rank as from, same file as to.

Plain UTF-8 text, one name per line, no header, no escaping. The line number (0-based) is the index value referenced from white_name_idx / black_name_idx in the ScoreHeader.

Example:

DrNykterstein
Magnus Carlsen
opperwezen
penguingm1
nihalsarin2004

Dictionaries are per-shard. Name index 42 in shard 3 refers to line 42 of shards/3.dict, not to a global position. To search by name across the corpus, the scan engine looks up the query string in each shard’s dict independently.

Names are stored verbatim. Unicode is preserved. No case-folding, no whitespace normalization. Matching is exact-string at query time. A name that appears as both "Magnus Carlsen" and "Carlsen, Magnus" in the source PGN will have two separate dict entries.

Per-shard binary sidecar containing one uint64_t per game in the shard: the byte offset of that game’s ScoreHeader within the shard’s .moves file. The flat layout enables O(matches) random access by game-id for bitmap-filtered scans and arbitrary parallel chunking.

+-------------------------------+
| uint64_t[game_count] |
| (game 0 offset, game 1, ...) |
+-------------------------------+

The file size is exactly 8 * game_count bytes; no header, no padding. game_count for each shard can be derived by dividing the file size by 8, or read from the corresponding entry in the bitmap’s shard table (the two must agree).

Offsets are byte offsets relative to the start of <N>.moves. The first record’s offset is always 0; the last record’s offset plus its own 36 + 2*move_count bytes equals the file size of <N>.moves.

indexer emits .gameidx by default. Pre-v14 indexers (now archived) did not emit this sidecar; the archived v1 scan engine falls back to scanning headers sequentially in that case. The current query-engine assumes .gameidx is present (the indexer always emits it).

One bitmap per file. Each bitmap is a set of matched games, dense per-shard, sharded to match the corpus.

+---------------------------------+
| Header (32 bytes) |
+---------------------------------+
| Shard table (16 B × N shards) |
+---------------------------------+
| Bits, shard 0 |
| Bits, shard 1 |
| ... |
| Bits, shard N-1 |
+---------------------------------+
OffsetSizeFieldNotes
08magicASCII "PMOTE-BM" (exactly 8 bytes, no terminator)
84versionu32, currently 1
124num_shardsu32
168total_gamesu64, summed across all shards
248reservedu64, zero
OffsetSizeFieldNotes
04shard_idu32
44game_countu32, games in this shard
88bits_offsetu64, byte offset from file start to this shard’s bits

Entries are in shard_id order. The shard table immediately follows the 32-byte header (so it starts at offset 32).

Per shard, ceil(game_count / 8) bytes. Within each byte, bit 0 (LSB) corresponds to the lowest game index in the shard, bit 1 to the next, etc. Game index 0 in a shard is the first game record in that shard’s .moves file.

bit_value(game_index_in_shard):
byte = game_index_in_shard / 8
bit = game_index_in_shard % 8
return (bits[byte] >> bit) & 1

Bits beyond game_count (in the trailing partial byte) must be zero. Readers can assume this; writers must ensure it.

A set bit means “this game matches the predicate the bitmap represents.” Bitmaps are per-corpus — a bitmap produced from one .scoredb is not portable to another .scoredb, even of the same source PGN, because shard counts and game indices depend on indexing parameters.

bitmap-combine operations (and, or, xor, sub, not) require the operand bitmaps to share identical shard layout: same num_shards, same per-shard game_count. Bitmaps from different corpora cannot be combined.

not produces the complement up to each shard’s game_count — bits beyond the last game stay zero per the spec.

A bitmap argument starting with @ resolves to a path inside the corpus directory:

@sicilian → <corpus>/bitmaps/sicilian.bm
@killer-A → <corpus>/bitmaps/killer-A.bm

Names may contain alphanumerics and -, _. No / or . in names — they’d break the resolution.

Outside a .scoredb context, @name references fall back to literal file paths (with a warning).

The current encoding is dense — every game contributes a bit whether matched or not. For 10 M games this is 1.27 MB per bitmap regardless of how many games are actually set.

Tradeoff with sparse encodings (Roaring, sorted-ID lists): dense is faster for the prefilter-scan hot path (sequential reads, branch prediction). Sparse is smaller for very-low-density bitmaps. We chose dense because the dominant access pattern is “scan corpus, test bit per game,” which favors dense memory layout.

This is a single localized decision; the format can grow to support encoding flags in a future version without disrupting the rest of the system. For now, no encoding flag is stored in the header; v1 implies dense.

Single file at the scoredb root holding the aggregate edge table: one row per unique (parent_hash, move_id) observed across the corpus, with W/D/B counts summed over all games that played that move from that position.

The file is produced by indexer at indexer finalize and consumed by explorer.

+------------------------------------+
| Header (32 bytes) |
+------------------------------------+
| Bucket offsets (4 × (nbuckets+1)) |
+------------------------------------+
| Edge[total_edges] |
| packed by bucket |
+------------------------------------+
OffsetSizeFieldNotes
08magicASCII "EDGE_TRE" (exactly 8 bytes, no terminator)
84versionu32, currently 1
124nbucketsu32. Must be a power of 2. Currently 1024.
168total_edgesu64, total number of Edge records
248reservedu64, zero

uint32[nbuckets+1] cumulative edge counts (not byte offsets). Bucket b spans [offsets[b], offsets[b+1]) in the Edge array. offsets[0] == 0 and offsets[nbuckets] == total_edges.

Edge record (28 bytes; payload stride is 32 with C padding)

Section titled “Edge record (28 bytes; payload stride is 32 with C padding)”
OffsetSizeFieldNotes
08parent_hashu64. Zobrist of position before the move.
84move_idu32. SMove16 in low 16 bits; high bits zero.
124totalu32. Game count summing W+D+B.
164wu32. Games White won from this edge.
204du32. Games drawn from this edge.
244bu32. Games Black won from this edge.

Total: 28 bytes; the C struct pads to 32 bytes on write, so the on-disk record stride is 32. move_id decoding follows the standard SMove16 layout (see Move encoding above).

uint32_t bucket_id = position_hash & (nbuckets - 1);
uint32_t lo = offsets[bucket_id];
uint32_t hi = offsets[bucket_id + 1];
for (uint32_t i = lo; i < hi; i++) {
if (edges[i].parent_hash == position_hash) {
// matching move from this position; emit edges[i]
}
}

Within a bucket the Edges are not sorted by parent_hash — they appear in hash-table iteration order. The bucket linear scan is fast in practice (~73K edges per bucket at 10 M scale → ~0.15 ms warm) because the access pattern streams sequentially through 28-byte records. A future format version may sort within buckets to enable binary-search lookup; see ROADMAP.md.

nbuckets = 1024 was chosen so each bucket holds ~73K edges at 10 M-game scale (75M edges / 1024). Smaller corpora have proportionally smaller buckets and faster scans. Larger corpora (>30 M games) may warrant raising nbuckets; the format supports any power of 2.

The parent_hash field is the engine’s Zobrist hash of the position before the move was played. Engines reading tree.dat for lookup must compute their hash compatibly:

  • Standard piece-square Zobrist keys.
  • Castling-rights, side-to-move, and en-passant-square contributions follow the engine in position.hpp.
  • FIDE-2024+ EP semantics: the EP-square contribution is only XOR’d in when an enemy pawn can legally capture en passant. Two positions that differ only in whether their FEN happens to record an EP “ghost” hash identically. This makes transpositions like 1.Nf3 d5 2.d41.d4 d5 2.Nf3 aggregate correctly in the tree.

Zobrist key tables are deterministic (seeded), so cross-machine hashes match for a given engine version.

For corpora indexed prior to the .scoredb directory format, the archived v1 scan engine (experiments/c-explorer/old/score-scan-v1.c) accepts a legacy stem path:

  • Stem path lichess_2017-02.pgn resolves to lichess_2017-02.pgn.t<N>.score and lichess_2017-02.pgn.t<N>.score.dict files in the same directory.
  • Note: the legacy format kept the original .score extension; only the modern .scoredb layout uses .moves.
  • Shard count defaults to 12 (the legacy default).

Legacy paths have no meta file and no per-corpus bitmap directory. The @name syntax is not supported in legacy mode.

The current query-engine reads only .scoredb directories. Legacy support is preserved in the archived v1 source for reference and for one-off use against old corpora; new work targets .scoredb.

  • .scoredb v2 (meta version=2, current): stable. Edge.move_id holds an SMove16 in its low 16 bits (from indexer onward); earlier v1 stored an FNV-1a hash of the SAN string in the same slot — same byte width, different semantics.
  • .scoredb v1 (meta version=1): legacy. Format-compatible byte layout but readers must treat the san_hash slot as opaque (no SMove16 decoding).
  • tree.dat v1: stable. Bumping requires re-indexing.
  • PMOTE-BM v1: stable. Same guarantee.
  • The SMove16 move encoding: stable.

Bumping a version field signals an incompatible change that requires re-indexing. Tools must check the version field and refuse unknown versions.

The 2026-05 .score.moves extension rename did not bump the scoredb version: only the filename on disk changed; the byte layout is identical.

// 1. Read meta file, parse num_shards.
int num_shards = read_meta_value(corpus_dir, "num_shards");
// 2. For each shard:
for (int s = 0; s < num_shards; s++) {
char path[1024];
snprintf(path, sizeof(path), "%s/shards/%d.moves", corpus_dir, s);
int fd = open(path, O_RDONLY);
struct stat st; fstat(fd, &st);
uint8_t *data = mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
close(fd);
// 3. Walk records.
const uint8_t *p = data, *end = data + st.st_size;
while (p < end) {
ScoreHeader *h = (ScoreHeader *)p;
uint32_t mc = h->move_count & 0xFFFFFF;
const uint16_t *moves = (const uint16_t *)(p + 36);
// ... process header + moves ...
p += 36 + 2 * mc;
}
}

This is the entire reader. No external dependencies, no parsing beyond field unpacking.