Text search

Each course is indexed across five fields with distinct BM25 weights:

Field	Weight	Source
`code`	5.0	course code (e.g. `15-122`), tokenized into prefix forms so partial codes match
`name`	2.0	course title
`instructor_names`	1.5	section instructor list
`description`	1.0	catalog description
`prereqs_text`	1.0	prereq prose, when present

Tokenization is field-specific. Course codes go through tokenize_code, which emits prefix variants so that 15, 15-1, and 15-122 are all reachable. Description, name, and prereq text use a general tokenizer that lowercases and strips punctuation; instructor names use the same general path with diacritic folding.

FST term dictionary plus baked postings

Per field, the index holds an FST mapping each term to an offset in a posting-list arena. A posting list is u32 df followed by df entries of (u32 doc_id, f32 score). Eight bytes per posting, no per-query allocation.

The f32 score is baked at build time, not computed at query time:

field_weight * idf * (tf * (K1 + 1)) / (tf + K1 * (1 - B + B * dl / avg_dl))

with K1 = 1.2 and B = 0.75. Query scoring is “read 8 bytes per posting, add to accumulator.” This is the dominant reason BM25 queries finish in microseconds even with a few thousand candidates per term.

Levenshtein typo tolerance

For tokens at least four characters long, the FST runs a Levenshtein automaton with edit distance one (or two for longer tokens). Fuzzy hits get a 0.5 score multiplier so exact matches always rank ahead. The “did you mean” hint surfaces the lowest-distance code variant when the user’s literal query returns nothing.

Prebuilt text region

The prebuilt_text region in the catalog binary stores both the FST bytes and the posting arena per field, so loaders skip the most expensive build phase. Index::build_with_prebuilt_text reuses these bytes directly; Index::build is only needed when generating a fresh catalog.bin from a Corpus.

Keyboard shortcuts

CMU Courses

Text search

FST term dictionary plus baked postings

Levenshtein typo tolerance

Prebuilt text region