Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Text search

Each course is indexed across five fields with distinct BM25 weights:

FieldWeightSource
code5.0course code (e.g. 15-122), tokenized into prefix forms so partial codes match
name2.0course title
instructor_names1.5section instructor list
description1.0catalog description
prereqs_text1.0prereq prose, when present

Tokenization is field-specific. Course codes go through tokenize_code, which emits prefix variants so that 15, 15-1, and 15-122 are all reachable. Description, name, and prereq text use a general tokenizer that lowercases and strips punctuation; instructor names use the same general path with diacritic folding.

FST term dictionary plus baked postings

Per field, the index holds an FST mapping each term to an offset in a posting-list arena. A posting list is u32 df followed by df entries of (u32 doc_id, f32 score). Eight bytes per posting, no per-query allocation.

The f32 score is baked at build time, not computed at query time:

field_weight * idf * (tf * (K1 + 1)) / (tf + K1 * (1 - B + B * dl / avg_dl))

with K1 = 1.2 and B = 0.75. Query scoring is “read 8 bytes per posting, add to accumulator.” This is the dominant reason BM25 queries finish in microseconds even with a few thousand candidates per term.

Levenshtein typo tolerance

For tokens at least four characters long, the FST runs a Levenshtein automaton with edit distance one (or two for longer tokens). Fuzzy hits get a 0.5 score multiplier so exact matches always rank ahead. The “did you mean” hint surfaces the lowest-distance code variant when the user’s literal query returns nothing.

Prebuilt text region

The prebuilt_text region in the catalog binary stores both the FST bytes and the posting arena per field, so loaders skip the most expensive build phase. Index::build_with_prebuilt_text reuses these bytes directly; Index::build is only needed when generating a fresh catalog.bin from a Corpus.