Data Format Reference
Wire-level specification of every artifact Edge reads or writes. This is the document you reach for when implementing a reader, debugging a malformed file, or extending the format.
Conventions:
- All multi-byte integers are little-endian.
- Sizes are bytes unless explicitly noted otherwise.
- Bit numbering is LSB=0 unless stated.
- All formats are stable at the indicated version; bumping the version field signals an incompatible change.
.scoredb directory layout
Section titled “.scoredb directory layout”The on-disk representation of one indexed corpus. Each directory holds exactly one corpus.
<stem>.<YYYYMMDD-HHMMSS>.scoredb/├── meta text, key=value├── tree.dat binary, aggregate edge table (optional, see below)├── shards/│ ├── 0.moves binary, per-game game records (see below)│ ├── 0.dict text, newline-separated names│ ├── 0.gameidx binary, uint64 per game (offsets into 0.moves)│ ├── 1.moves│ ├── 1.dict│ ├── 1.gameidx│ └── ...└── bitmaps/ optional, populated by user ├── <name>.bm single bitmap, PMOTE-BM format └── ...Directory naming: the default is <source-pgn-stem>.<YYYYMMDD-HHMMSS>.scoredb,
producing a unique name per indexing run. --out PATH overrides.
Required entries: meta and shards/ must exist. tree.dat is
emitted by indexer and required by explorer; tools that
only need per-game data may ignore it. bitmaps/ is optional but is
created empty by the indexer.
File extension note: the per-shard game-record files were named
<N>.score prior to 2026-05; the extension was renamed to <N>.moves
to better describe the contents (a stream of SMove16, not chess
scores). Legacy .score files produced by pre-v14 indexers are
readable only by the archived v1 scan engine
(experiments/c-explorer/old/score-scan-v1.c); the current
query-engine reads .moves only.
meta file format
Section titled “meta file format”Plain text, ASCII, key=value\n per line. Trailing whitespace is not
trimmed; values are taken verbatim up to the newline. Unknown keys
are ignored by readers.
version=2num_shards=16total_games=10164006total_moves=199672940source_pgn_stem=lichess_2017-02source_pgn_size=9289131923indexed_at=20260524-153045Mandatory keys:
| Key | Type | Meaning |
|---|---|---|
version | int | Format version. Currently 2 (see “Version history” at bottom). |
num_shards | int | Number of <N>.moves / <N>.dict / <N>.gameidx files under shards/. |
total_games | int | Count of game records across all shards. |
total_moves | int | Count of moves (plies) across all games. |
Optional keys:
| Key | Type | Meaning |
|---|---|---|
source_pgn_stem | string | Basename of the source PGN without extension. |
source_pgn_size | int | Byte size of the source PGN at index time. |
indexed_at | string | Timestamp of the indexing run (YYYYMMDD-HHMMSS). |
A reader needing only the shard count needs only version and
num_shards.
.moves file format
Section titled “.moves file format”Per-shard binary file holding game records. One .moves file per
shard, named <shard-id>.moves under shards/. The shard IDs are
contiguous integers from 0 to num_shards-1.
A .moves file is a flat concatenation of game records, no header,
no padding between records, no per-shard footer.
Game record layout
Section titled “Game record layout”Each game record is variable-size:
+--------------------------------+| ScoreHeader (36 B) |+--------------------------------+| moves[move_count] (2 B each) |+--------------------------------+move_count is read from the header. The record length is
36 + 2 * move_count bytes. The next record begins immediately after.
ScoreHeader (36 bytes)
Section titled “ScoreHeader (36 bytes)”| Offset | Size | Field | Type | Notes |
|---|---|---|---|---|
| 0 | 8 | pgn_offset | u64 | Byte offset of [Event in the source PGN. Used to retrieve the original game text. |
| 8 | 4 | move_count | u32 | Low 24 bits used; high 8 reserved. Range 0 to ~16.7 M. |
| 12 | 4 | packed1 | u32 | Result / termination / time control — see below. |
| 16 | 2 | white_elo | u16 | 13 bits used. Saturates at 8191 (0x1FFF). 0 = unknown. |
| 18 | 2 | black_elo | u16 | Same as white_elo. |
| 20 | 2 | eco | u16 | 10 bits used. 0..499 maps to A00..E99. 1023 (0x3FF) = unknown. |
| 22 | 2 | pad0 | u16 | Reserved (zero). |
| 24 | 4 | date_packed | u32 | Year/month/day packed — see below. |
| 28 | 4 | white_name_idx | u32 | Index into this shard’s .dict. 0xFFFFFFFF = unknown. |
| 32 | 4 | black_name_idx | u32 | Same as white_name_idx. |
Total: 36 bytes. Naturally aligned to 4 bytes; not aligned to 8.
packed1 bit layout (32 bits)
Section titled “packed1 bit layout (32 bits)”| Bits | Field | Width | Range / Encoding |
|---|---|---|---|
| 0–2 | result | 3 | 0=* (unknown), 1=1-0, 2=1/2-1/2, 3=0-1, others reserved |
| 3–5 | termination | 3 | 0=Normal, 1=Time forfeit, 2=Abandoned, 3=Rules infraction, 7=Other, 4-6 reserved |
| 6–18 | tc_base_seconds | 13 | Base time in seconds; saturates at 8191 (0x1FFF) |
| 19–24 | tc_increment | 6 | Increment per move in seconds; saturates at 63 (0x3F) |
| 25 | tc_is_correspondence | 1 | 1 if TC string was - or base saturated |
| 26–31 | reserved | 6 | Zero |
Time-control category is derived from packed1 at query time:
if (tc_is_correspondence) -> correspondencetotal = tc_base_seconds + 60 * tc_incrementif total < 180 -> bulletif total < 600 -> blitzif total < 3600 -> rapidelse -> classicaldate_packed bit layout (32 bits)
Section titled “date_packed bit layout (32 bits)”| Bits | Field | Width | Range |
|---|---|---|---|
| 0–11 | year | 12 | 0–4095 (0 = unknown) |
| 12–15 | month | 4 | 1–12 (0 = unknown) |
| 16–20 | day | 5 | 1–31 (0 = unknown) |
| 21–31 | reserved | 11 | Zero |
date_packed == 0 indicates a fully unknown date.
eco field encoding
Section titled “eco field encoding”10-bit unsigned integer, range 0..499 representing ECO codes
A00..E99:
letter = 'A' + (eco / 100) // 'A' .. 'E'digits = eco % 100 // 00 .. 99So B22 is (1 * 100) + 22 = 122. Value 1023 (0x3FF) reserved
for “ECO unknown / not present in PGN.”
Range queries (--eco B20-B99) compile to inclusive [122, 199].
Move encoding (SMove16)
Section titled “Move encoding (SMove16)”Each move in the moves[] array is 16 bits:
| Bits | Field | Notes |
|---|---|---|
| 0–5 | from square | 0–63, standard chess square encoding (a1=0, h1=7, a8=56, h8=63) |
| 6–11 | to square | 0–63, same encoding |
| 12–13 | promotion | 0=none, 1=knight, 2=bishop, 3=rook. Queen promotion uses flags=3 (see below). |
| 14–15 | flags | 0=normal, 1=castle, 2=en-passant capture, 3=queen promotion |
Square encoding: standard 0–63 with square = rank * 8 + file,
where rank=0 is white’s first rank. Engines that internally use 0x88
must convert.
Promotion encoding: the promotion type uses two distinct paths:
- For knight/bishop/rook:
flags = 0,promo = 1..3. - For queen:
flags = 3,promois ignored (set to 0 by convention).
This compresses the common case (queen promotion is the dominant
choice) and frees promo == 0 && flags == 0 for “no promotion.”
Castling: flags = 1. The from and to are the king’s squares;
the rook movement is derived (king-side castling: rook from h-file to
f-file; queen-side: a-file to d-file).
En passant capture: flags = 2. from and to are the capturing
pawn’s squares; the captured pawn is on the same rank as from, same
file as to.
.dict file format
Section titled “.dict file format”Plain UTF-8 text, one name per line, no header, no escaping. The
line number (0-based) is the index value referenced from
white_name_idx / black_name_idx in the ScoreHeader.
Example:
DrNyktersteinMagnus Carlsenopperwezenpenguingm1nihalsarin2004Dictionaries are per-shard. Name index 42 in shard 3 refers to
line 42 of shards/3.dict, not to a global position. To search by
name across the corpus, the scan engine looks up the query string in
each shard’s dict independently.
Names are stored verbatim. Unicode is preserved. No case-folding,
no whitespace normalization. Matching is exact-string at query time.
A name that appears as both "Magnus Carlsen" and "Carlsen, Magnus"
in the source PGN will have two separate dict entries.
.gameidx file format
Section titled “.gameidx file format”Per-shard binary sidecar containing one uint64_t per game in the
shard: the byte offset of that game’s ScoreHeader within the
shard’s .moves file. The flat layout enables O(matches) random
access by game-id for bitmap-filtered scans and arbitrary parallel
chunking.
+-------------------------------+| uint64_t[game_count] || (game 0 offset, game 1, ...) |+-------------------------------+The file size is exactly 8 * game_count bytes; no header, no
padding. game_count for each shard can be derived by dividing the
file size by 8, or read from the corresponding entry in the
bitmap’s shard table (the two must agree).
Offsets are byte offsets relative to the start of <N>.moves. The
first record’s offset is always 0; the last record’s offset plus
its own 36 + 2*move_count bytes equals the file size of <N>.moves.
indexer emits .gameidx by default. Pre-v14 indexers
(now archived) did not emit this sidecar; the archived v1 scan engine
falls back to scanning headers sequentially in that case. The current
query-engine assumes .gameidx is present (the indexer always
emits it).
.bm bitmap file format
Section titled “.bm bitmap file format”One bitmap per file. Each bitmap is a set of matched games, dense per-shard, sharded to match the corpus.
File layout
Section titled “File layout”+---------------------------------+| Header (32 bytes) |+---------------------------------+| Shard table (16 B × N shards) |+---------------------------------+| Bits, shard 0 || Bits, shard 1 || ... || Bits, shard N-1 |+---------------------------------+Header (32 bytes)
Section titled “Header (32 bytes)”| Offset | Size | Field | Notes |
|---|---|---|---|
| 0 | 8 | magic | ASCII "PMOTE-BM" (exactly 8 bytes, no terminator) |
| 8 | 4 | version | u32, currently 1 |
| 12 | 4 | num_shards | u32 |
| 16 | 8 | total_games | u64, summed across all shards |
| 24 | 8 | reserved | u64, zero |
Shard table (16 bytes per shard)
Section titled “Shard table (16 bytes per shard)”| Offset | Size | Field | Notes |
|---|---|---|---|
| 0 | 4 | shard_id | u32 |
| 4 | 4 | game_count | u32, games in this shard |
| 8 | 8 | bits_offset | u64, byte offset from file start to this shard’s bits |
Entries are in shard_id order. The shard table immediately follows the 32-byte header (so it starts at offset 32).
Bits payload
Section titled “Bits payload”Per shard, ceil(game_count / 8) bytes. Within each byte, bit 0
(LSB) corresponds to the lowest game index in the shard, bit 1 to the
next, etc. Game index 0 in a shard is the first game record in that
shard’s .moves file.
bit_value(game_index_in_shard): byte = game_index_in_shard / 8 bit = game_index_in_shard % 8 return (bits[byte] >> bit) & 1Bits beyond game_count (in the trailing partial byte) must be
zero. Readers can assume this; writers must ensure it.
Bitmap semantics
Section titled “Bitmap semantics”A set bit means “this game matches the predicate the bitmap
represents.” Bitmaps are per-corpus — a bitmap produced from one
.scoredb is not portable to another .scoredb, even of the same
source PGN, because shard counts and game indices depend on indexing
parameters.
Set algebra constraints
Section titled “Set algebra constraints”bitmap-combine operations (and, or, xor, sub, not) require
the operand bitmaps to share identical shard layout: same
num_shards, same per-shard game_count. Bitmaps from different
corpora cannot be combined.
not produces the complement up to each shard’s game_count — bits
beyond the last game stay zero per the spec.
@name bitmap references
Section titled “@name bitmap references”A bitmap argument starting with @ resolves to a path inside the
corpus directory:
@sicilian → <corpus>/bitmaps/sicilian.bm@killer-A → <corpus>/bitmaps/killer-A.bmNames may contain alphanumerics and -, _. No / or . in names —
they’d break the resolution.
Outside a .scoredb context, @name references fall back to literal
file paths (with a warning).
Bitmap density and encoding choice
Section titled “Bitmap density and encoding choice”The current encoding is dense — every game contributes a bit whether matched or not. For 10 M games this is 1.27 MB per bitmap regardless of how many games are actually set.
Tradeoff with sparse encodings (Roaring, sorted-ID lists): dense is faster for the prefilter-scan hot path (sequential reads, branch prediction). Sparse is smaller for very-low-density bitmaps. We chose dense because the dominant access pattern is “scan corpus, test bit per game,” which favors dense memory layout.
This is a single localized decision; the format can grow to support encoding flags in a future version without disrupting the rest of the system. For now, no encoding flag is stored in the header; v1 implies dense.
tree.dat file format
Section titled “tree.dat file format”Single file at the scoredb root holding the aggregate edge table:
one row per unique (parent_hash, move_id) observed across the
corpus, with W/D/B counts summed over all games that played that
move from that position.
The file is produced by indexer at indexer finalize and
consumed by explorer.
File layout
Section titled “File layout”+------------------------------------+| Header (32 bytes) |+------------------------------------+| Bucket offsets (4 × (nbuckets+1)) |+------------------------------------+| Edge[total_edges] || packed by bucket |+------------------------------------+Header (32 bytes)
Section titled “Header (32 bytes)”| Offset | Size | Field | Notes |
|---|---|---|---|
| 0 | 8 | magic | ASCII "EDGE_TRE" (exactly 8 bytes, no terminator) |
| 8 | 4 | version | u32, currently 1 |
| 12 | 4 | nbuckets | u32. Must be a power of 2. Currently 1024. |
| 16 | 8 | total_edges | u64, total number of Edge records |
| 24 | 8 | reserved | u64, zero |
Bucket offset table
Section titled “Bucket offset table”uint32[nbuckets+1] cumulative edge counts (not byte offsets).
Bucket b spans [offsets[b], offsets[b+1]) in the Edge array.
offsets[0] == 0 and offsets[nbuckets] == total_edges.
Edge record (28 bytes; payload stride is 32 with C padding)
Section titled “Edge record (28 bytes; payload stride is 32 with C padding)”| Offset | Size | Field | Notes |
|---|---|---|---|
| 0 | 8 | parent_hash | u64. Zobrist of position before the move. |
| 8 | 4 | move_id | u32. SMove16 in low 16 bits; high bits zero. |
| 12 | 4 | total | u32. Game count summing W+D+B. |
| 16 | 4 | w | u32. Games White won from this edge. |
| 20 | 4 | d | u32. Games drawn from this edge. |
| 24 | 4 | b | u32. Games Black won from this edge. |
Total: 28 bytes; the C struct pads to 32 bytes on write, so the
on-disk record stride is 32. move_id decoding follows the standard
SMove16 layout (see Move encoding above).
Lookup procedure
Section titled “Lookup procedure”uint32_t bucket_id = position_hash & (nbuckets - 1);uint32_t lo = offsets[bucket_id];uint32_t hi = offsets[bucket_id + 1];
for (uint32_t i = lo; i < hi; i++) { if (edges[i].parent_hash == position_hash) { // matching move from this position; emit edges[i] }}Within a bucket the Edges are not sorted by parent_hash —
they appear in hash-table iteration order. The bucket linear scan
is fast in practice (~73K edges per bucket at 10 M scale → ~0.15 ms
warm) because the access pattern streams sequentially through 28-byte
records. A future format version may sort within buckets to enable
binary-search lookup; see ROADMAP.md.
Bucket size guidance
Section titled “Bucket size guidance”nbuckets = 1024 was chosen so each bucket holds ~73K edges at
10 M-game scale (75M edges / 1024). Smaller corpora have proportionally
smaller buckets and faster scans. Larger corpora (>30 M games) may
warrant raising nbuckets; the format supports any power of 2.
Position hashing convention
Section titled “Position hashing convention”The parent_hash field is the engine’s Zobrist hash of the position
before the move was played. Engines reading tree.dat for lookup
must compute their hash compatibly:
- Standard piece-square Zobrist keys.
- Castling-rights, side-to-move, and en-passant-square contributions
follow the engine in
position.hpp. - FIDE-2024+ EP semantics: the EP-square contribution is only
XOR’d in when an enemy pawn can legally capture en passant. Two
positions that differ only in whether their FEN happens to record
an EP “ghost” hash identically. This makes transpositions like
1.Nf3 d5 2.d4≡1.d4 d5 2.Nf3aggregate correctly in the tree.
Zobrist key tables are deterministic (seeded), so cross-machine hashes match for a given engine version.
Backward compatibility and legacy paths
Section titled “Backward compatibility and legacy paths”For corpora indexed prior to the .scoredb directory format, the
archived v1 scan engine
(experiments/c-explorer/old/score-scan-v1.c) accepts a legacy
stem path:
- Stem path
lichess_2017-02.pgnresolves tolichess_2017-02.pgn.t<N>.scoreandlichess_2017-02.pgn.t<N>.score.dictfiles in the same directory. - Note: the legacy format kept the original
.scoreextension; only the modern.scoredblayout uses.moves. - Shard count defaults to 12 (the legacy default).
Legacy paths have no meta file and no per-corpus bitmap directory.
The @name syntax is not supported in legacy mode.
The current query-engine reads only .scoredb directories. Legacy
support is preserved in the archived v1 source for reference and for
one-off use against old corpora; new work targets .scoredb.
Version history and stability
Section titled “Version history and stability”.scoredbv2 (metaversion=2, current): stable. Edge.move_id holds an SMove16 in its low 16 bits (fromindexeronward); earlier v1 stored an FNV-1a hash of the SAN string in the same slot — same byte width, different semantics..scoredbv1 (metaversion=1): legacy. Format-compatible byte layout but readers must treat thesan_hashslot as opaque (no SMove16 decoding).tree.datv1: stable. Bumping requires re-indexing.PMOTE-BM v1: stable. Same guarantee.- The SMove16 move encoding: stable.
Bumping a version field signals an incompatible change that requires
re-indexing. Tools must check the version field and refuse unknown
versions.
The 2026-05 .score → .moves extension rename did not bump
the scoredb version: only the filename on disk changed; the byte
layout is identical.
Reading a .scoredb from scratch (sketch)
Section titled “Reading a .scoredb from scratch (sketch)”// 1. Read meta file, parse num_shards.int num_shards = read_meta_value(corpus_dir, "num_shards");
// 2. For each shard:for (int s = 0; s < num_shards; s++) { char path[1024]; snprintf(path, sizeof(path), "%s/shards/%d.moves", corpus_dir, s); int fd = open(path, O_RDONLY); struct stat st; fstat(fd, &st); uint8_t *data = mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0); close(fd);
// 3. Walk records. const uint8_t *p = data, *end = data + st.st_size; while (p < end) { ScoreHeader *h = (ScoreHeader *)p; uint32_t mc = h->move_count & 0xFFFFFF; const uint16_t *moves = (const uint16_t *)(p + 36);
// ... process header + moves ...
p += 36 + 2 * mc; }}This is the entire reader. No external dependencies, no parsing beyond field unpacking.