Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Overview

The CMU Courses project has three layers:

  1. courses-scraper pulls course information, sections, program audits, and syllabi from Stellic and Canvas. Output lands under exported/ as JSON files keyed by course code, program, and term.

  2. courses-index loads exported/ into a single binary catalog with text, facet, numeric, and schedule indexes baked in. The same crate compiles to native (for the build step that produces catalog.bin) and to wasm (for the browser query engine).

  3. Two APIs consume the catalog: courses-api is the public REST surface (versioned under /v1/, in-memory, no database), and courses-web-api is the web app’s backend, which serves the SPA bundle and the catalog binary to the browser on a single origin.

The scraper section below documents the per-source collection logic. The courses index section documents the on-disk catalog format and the query engine.

Running the scraper

CLI

FlagEnvDefaultPurpose
--cookie-headerCOOKIE_HEADER(prompt)Stellic full Cookie: header value. Required for courses, programs, all. The cookie’s session identifies the user; no separate Andrew ID is required.
--canvas-tokenCANVAS_TOKEN(none)Canvas API token. Required for syllabi and all. Generate at https://canvas.cmu.edu/profile/settings.
--modeMODEcoursesOne of courses, programs, syllabi, all. Selects which pipeline runs (all runs the three in parallel).
--fce-pathFCE_PATHdata/fces.csvSmartEvals FCE CSV (used in courses and all).
--out-dirOUT_DIRexported/courses_historyCourse pipeline output root.
--programs-dirPROGRAMS_DIRexported/programsPrograms pipeline output root.
--syllabi-dirSYLLABI_DIRexported/syllabiSyllabi pipeline output root.
--concurrencyCONCURRENCY32Worker count for the rayon pool that runs HTTP tasks.
--limitLIMIT(no limit)Cap on number of tasks, for smoke tests. Applied to all pipelines that run.

Log filtering is via RUST_LOG (tracing EnvFilter); the default level is info.

Output layout

Course pipeline:

<out_dir>/
  <course_code_no_dash>/
    info.json                 # /catalog/getcourseinfo/ response with user-state stripped
    ly<lyear>_sm<sem_id>.json # /planner/getcoursesections/ response, one per (lyear, sem)

The course directory uses the dash-stripped form (e.g. 21122 for 21-122). Re-runs overwrite, and there is no incremental skip.

info.json strips student_context, enrollment_action_windows, and alerts from the upstream response before writing, since those reflect user state rather than catalog data. The sections file strips the current field from each data_list entry. When data_list is empty, the scraper writes nothing, so the absence of a sections file means Stellic returned no sections for that (course, lyear, sem).

Programs pipeline:

<programs_dir>/
  <catalog_id>/
    <audit_id>.json   # one audit version of one catalog program

Each file wraps the matching req_tree.programs[] subtree with audit and program metadata, plus the non-personal scalar fields from the audit response:

{
  "catalog_id": <int>,                       // /catalog/getprograms/ id
  "program_name": <str>,                     // catalog program name
  "program_type": <int>,                     // 1=major, 2=minor, 3=add'l major, 4=sub-req bundle, 5=eligibility
  "audit_id": <int>,                         // audit publication id from getauditversions/
  "audit_name": <str>,                       // e.g. "EY2021 Pittsburgh - BS in Mathematical Sciences"
  "requirement_id": <int>,                   // surfaces as req_tree.programs[].id
  "is_combination": <bool>,                  // audit-level flag from the response
  "free_electives_req": <obj|null>,          // free electives spec for this audit version
  "program_reqs": <obj|null>,                // program_reqs[<requirement_id>] only
  "unique_course_parents_mapping": <obj>,    // filtered: only entries that point to this audit version
  "tree": { id, screen_name, constraints, choices, ... }   // the matching req_tree.programs[] node
}

Stripped before writing: audit_data, top-level programs, course_plan_info, placeholders_info, unmatched, notcounted, unmatched_slots, permissions, program_permissions, student_audit, full_gpa, student_enrollment_levels, plan_diplomas, remaining_reqs_details, last_computed, debug_info, uni_req_programs. Cross-reference fields (unique_course_parents_mapping, program_reqs) are filtered to keys that match the requested audit version, since unfiltered they would also include the caller’s auto-attached gen-ed audits and leak the caller’s college affiliation.

Syllabi pipeline:

<syllabi_dir>/
  <term_code>/                   # e.g. S26, F24, M25, N23
    <dept_code>/                 # e.g. ARC, CS, ART
      <course_section>.<ext>     # File items: original PDF/DOC
      <course_section>.url       # Page items: plain-text Canvas page URL

The pipeline only saves items in each sub-course’s “Available Syllabi” module. File items are downloaded with their original filename extension. Page items are saved as a single-line .url file containing the Canvas page URL where the syllabus is rendered, since the page body itself is just a redirect stub and the target course’s syllabus_body is empty most of the time and ambiguous when populated. Downstream consumers can either link directly to the URL or follow it for users who can authenticate. “Unavailable Syllabi” and “Individualized Experiences” modules are skipped because their items are placeholders without retrievable content. See syllabi.md for the registry structure and the rationale.

The syllabi pipeline is rerun-safe: before any HTTP work, save_task checks whether <course_section>.* already exists under <term>/<dept>/ and skips if so. To force a full re-fetch, delete the matching files (or the whole <syllabi_dir>) before running.

Concurrency

A custom rayon thread pool sized by --concurrency runs every pipeline. When --mode all, the three pipelines run concurrently within the same pool via nested rayon::join, sharing the thread budget; their outputs go to separate directories so they don’t collide. A shared ureq::Agent provides the HTTP connection pool internally.

Course pipeline:

  1. Startup discovery uses rayon::join to overlap the FCE parse (single-threaded CSV read) with the four SOC fetches, which themselves run as a rayon par_iter across the four seasons.
  2. Task execution sends one HTTP GET per task (course info or course sections), so --concurrency directly bounds in-flight HTTP requests. Progress is logged every 500 completed tasks.

Programs pipeline:

  1. Audit-version discovery: one GET /planner/getauditversions/ per catalog program in parallel, building the task list. Progress is logged every 200 programs.
  2. Audit fetch: one POST /planner/getauditinfo/ (test-apply mode) per task in parallel. Progress is logged every 200 completed tasks.

Step 2 of the programs pipeline is heavier than any course-pipeline call: each getauditinfo returns ~80 KB and the server takes several seconds to compute the audit, so its tolerable concurrency is lower than the course endpoint’s.

Syllabi pipeline:

  1. Sub-course discovery: walks the master syllabus-registry course’s term modules, resolves each dept’s sis_course_id, and lists each sub-course’s modules to gather Available Syllabi items. Roughly 1797 sub-course module-list calls.
  2. Item save: for each File item, fetches file metadata then downloads the file; for each Page item, fetches the page body and writes its HTML.

Empirical concurrency findings:

  • Programs pipeline (getauditinfo test-apply): best throughput at concurrency 32 (~1.2 s per task on a 512-task run). Concurrency 64 dropped to ~3 s per task with sporadic failures, and 128 stalled. The default is 32.
  • Course pipeline (getcourseinfo, getcoursesections): handles concurrency 64 cleanly during full-history scrapes; the per-call cost is small enough that the bottleneck is HTTP latency rather than server compute. If running the course pipeline alone, raising --concurrency to 64 is safe.
  • Syllabi pipeline (Canvas file downloads): plateaus around concurrency 32 at ~28 MB/s on the dev machine. 64 and 128 do not improve throughput, with 128 slightly regressing. The bottleneck is most likely client-side bandwidth or Canvas’s per-IP connection cap.

Course discovery

How the scraper builds the course list and the per-course tasks it sends to Stellic.

Sources

The course set is the union of two feeds.

  1. The FCE CSV at data/fces.csv, exported from SmartEvals. Columns used: Year, Sem (Fall/Spring/Summer), Num (5-digit, e.g. 21122). Each row indicates the course was offered in that (year, sem).

  2. Schedule of Classes dumps at https://enr-apps.as.cmu.edu/assets/SOC/sched_layout_{season}.dat, where season is one of fall, spring, summer_1, summer_2. Tab-delimited; the term comes from a Semester: <Fall|Spring|Summer> <year> header line, course codes from rows starting with a tab.

A failed fetch (404 or parse failure) on a season means the season has not been published yet. That season contributes no codes or tuples.

5-digit Num values are normalized to DD-DDD. Rows that don’t match the 5-digit pattern are dropped.

Course code format

Canonical form is 21-122. The catalog and SOC use this form. The FCE CSV uses the 5-digit form. On disk, output directories use the dash-stripped form (21122).

Task generation

For each course we emit one info task plus zero or more sections tasks. A sections task is identified by (course, lyear, sem_id).

Section tasks come from the (year, sem) tuples we know the course was offered in: FCE tuples plus the SOC term if that course appears in that season’s SOC. Each tuple is converted to an lyear and dropped if out of range.

lyear

Stellic’s getcoursesections/ URL parameter is named year, but the value is not a calendar year. It is an academic-year offset, 0-indexed from the student’s term_joined. Internally we call the variable lyear so it doesn’t collide with calendar year. On the wire it is still year=<n>.

Computation:

anchor = ay_start(joined_sem, joined_year)        # AY of the user's term_joined
lyear  = ay_start(sem, year) - anchor + 1

ay_start returns the calendar year the AY started in: Fall maps to its own year; Spring/Summer map to year - 1.

Bounds: lyear ∈ 0..=3. Out-of-range values (negative or > 3) yield server errors or empty data. Stellic exposes about four academic years from the user’s join term. Scraping further back requires an account with an earlier term_joined.

98- courses (StuCo)

StuCo instructors can opt out of FCEs, so a 98- course that ran may have no FCE tuples. When a 98- course has no FCE tuples, we issue section tasks for every lyear ∈ 0..=3 crossed with {Fall, Spring, Summer}. Most come back empty from Stellic and the section save no-ops on empty data.

Deduplication

BoundaryMechanism
FCE rows to (year, sem) tuples per courseHashSet<(year, sem)>
FCE codes union SOC codesHashSet<String>
FCE tuples union SOC term per courseHashSet<(year, sem)>
Re-runs (output files)None; re-runs overwrite the output files.

Stellic

CMU’s Stellic deployment lives at academicaudit.andrew.cmu.edu. None of this API is publicly documented; the catalog below is reverse-engineered from the live JS bundle and verified by direct request.

Auth

The user authenticates in a browser via Shibboleth, then copies the full Cookie: header from a network request to the site and pastes it on first run. From that cookie we extract the value of csrftoken= and send it as the X-CSRFToken request header, while the _shibsession_* and sessionid cookies are passed through verbatim. There is no API token or refresh flow, so once Shibboleth expires the session the scraper re-prompts.

XSSI prefix

Every JSON response begins with )]}',\n, an XSSI-prevention prefix that we strip before parsing.

Headers

All requests need:

  • Cookie: <full pasted header>
  • X-CSRFToken: <csrftoken value>
  • X-Requested-With: XMLHttpRequest
  • Referer: https://academicaudit.andrew.cmu.edu/app/... (some endpoints 403 without a same-origin Referer)

POST requests additionally need a content type, but Stellic is inconsistent about which type each endpoint accepts: some require form-urlencoded, others require JSON. The per-endpoint sections below specify which.

Endpoints

POST /planner/getstudentprofile/

Body (form): empty, or student_username=<anything>. The session cookie identifies the user; the body’s student_username value is ignored by the server for student-role callers (Stellic returns the cookie’s profile regardless of the value sent). Returns the student profile, of which the scraper uses username (echoed back in subsequent calls’ student_username field), default_plan_id, and term_joined: {semester, year} (the latter anchors the lyear window). The scraper also uses this call as its auth check on startup: if the response is not XSSI-prefixed JSON, the cookie is bad and we re-prompt.

GET /catalog/getcourseinfo/

Query: campus_id=1&course_code=21-122&physical_year=2026. Returns metadata for one course; before saving, the scraper strips student_context, enrollment_action_windows, and alerts, which all reflect user state rather than catalog data.

GET /planner/getcoursesections/

Query: campus_id=1&course_code=21-122&physical_year=2026&plan_id=<uuid>&sem_id=<1|2|3>&year=<lyear>. Returns sections for one course in one (lyear, sem). plan_id is required even though the data is not plan-specific, so we pass the user’s default_plan_id, and an empty data_list is the normal response for a course that was not offered (the scraper writes nothing in that case).

The year URL parameter is the lyear offset (see discovery.md), not a calendar year, and sem_id maps 1 to Fall, 2 to Spring, 3 to Summer.

GET /catalog/getprograms/

Query: campus_id=1. Returns the flat list of all programs in the catalog (currently 2129), where each entry is {id, name, type}. The type integer is only exposed here, and once a program is added to a plan the label is gone. See programs.md for what the type values mean.

POST /planner/createplan/

Body (JSON): {name, programs: [<catalog_id>, ...], visibility: "private"|"advisor"} (the programs list may be empty). Returns {success, new_plan_id, new_plan_name}.

POST /planner/deleteplan/

Body (JSON): {plan_id, reason}. JSON only; the form-urlencoded variant returns 500.

POST /planner/add-program/

Body (form): plan_id=<uuid>&program_id=<catalog_id>, adding one program to a plan. Success returns the program’s metadata; failure returns {success: false, message: {code, text}} with one of:

  • 493, “Too many programs added. For now, you can’t add more than 5 majors.” See “Limits”.
  • 400, “We are unable to process your request at the moment. Please try again later.” This usually means the program has no audit version compatible with the plan’s campus or the student’s level (for example, a graduate-only program on an undergrad plan).

POST /planner/getauditinfo/

The same endpoint serves two distinct flows depending on the body shape.

Plan audit. Body (JSON): {plan_id}. Returns the full audit for whatever programs are in the plan. Fields of interest:

  • audit_data: per-requirement, the courses chosen toward it and the remaining count.
  • req_tree: the nested requirement tree, with course satisfiers attached as choices nodes carrying course codes.
  • unique_course_parents_mapping: course requirement id to a {audit_version_id: padded_id} map, used to determine which audit version a course satisfies.
  • programs: the catalog programs currently in the plan, including auto-attached ones (see “Auto-attached programs”).

The JS sends additional fields (mainaudit, official, uids, isTemplate, etc.) but the server accepts the body with plan_id alone. Response size scales with program count.

Test-apply audit (no plan). Body (JSON): {student_username, audit, default_audit_version: {id}, official: true} where audit is the audit publication id from getauditversions/. Returns the same shape as the plan audit but for a single audit version, against the calling user’s profile, without creating or modifying any plan. The user’s currently declared programs are still returned alongside the requested one in req_tree.programs[]; the scraper filters those out by matching req_tree.programs[].id == <audit_version.requirement>. This is the path the requirement scraper takes, since it bypasses both the 3-plan and 5-major caps.

GET /planner/getauditversions/

Query: program=<catalog_id>&status=published. Returns {audits: [{id, requirement, name, ...}, ...]} listing every published audit version of a program. The id is the audit publication id (use it as the audit field in getauditinfo’s test-apply body); requirement is the audit-version-id that surfaces as req_tree.programs[].id and is the value to match on when filtering the audit response. The two are distinct integers and both are needed.

POST /planner/getauditinfobulk/

Body (JSON): {student_usernames: [<username>, ...], audits_to_print: [{auditId, reqId, include}, ...], official, without_planned, grades, force, purpose: "print"}. Used by advisors to compute audits for many students at once for PDF printing. Returns {student_audits: {<username>: {req_trees: {<auditId>: ...}, audit_data, ...}, ...}, audit_reqs: {...}}.

The endpoint is user-scoped: the server runs each listed student’s own declared-program audit, and audits_to_print filters which of those declared audits show up in the response. Audit ids in audits_to_print that aren’t in the student’s declared set are silently dropped, so this endpoint cannot be used to fetch arbitrary catalog programs. The requirement scraper does not use it.

GET /planner/getprogramsrequirements/

Query: program_id=<id>&programs=<id> (both required, both the same value). Returns {requirement_dict: {<program_id>: [{id, name, level, parent_id}, ...]}}. The tree is a skeleton without course satisfiers; the scraper does not currently use this endpoint because the test-apply audit returns the full tree with courses.

Limits

ResourceCapNotes
Plans per user3Server-enforced; the 4th createplan returns code 493 with text “You can’t have more than 3 plans at a time.”
Majors (type 1) per plan5Server-enforced and counts the user’s existing declared majors, so a student with two declared majors gets only three free major slots on a fresh plan.
Minors, additional majors, sub-requirement bundles, eligibilityNone observed22 mixed programs of these types fit in one plan without rejection.
lyear on getcoursesections/0..=3Out-of-range values fail or return empty.

The plan and major caps only matter for the plan-based audit flow. The requirement scraper uses the test-apply path, which is plan-free, so neither cap applies to it.

Retries

The scraper retries 5xx responses with exponential backoff capped at 5 seconds, while 4xx fails immediately. No explicit rate limit has been observed, but high concurrency occasionally produces transient 502/504 from the upstream proxy and the retry handles those.

Auto-attached programs

Adding a major (catalog type 1) to a plan auto-attaches that major’s college gen-ed and degree-check programs (catalog type 4) into the audit. They surface in the getauditinfo response with is_uni_req: true and is_shared: true and contribute their full requirement trees, so the scraper does not need a separate add-program call for those.

Programs

The Stellic catalog exposes 2129 programs through GET /catalog/getprograms/. This doc covers the type vocabulary, the two ID spaces a program lives in, and the shape of the requirement tree.

Catalog vs audit-version IDs

Each program has two distinct integer IDs that surface in different responses:

  • catalog id, returned by getprograms/ and accepted by add-program/. Small integers in the low thousands.
  • audit version id, the internal identifier of the requirement tree the audit engine uses for the program. Larger integers in the millions, showing up as req_tree.programs[].id, as the inner keys of unique_course_parents_mapping, and inside the padded ids described below.

The getauditinfo response uses both: its top-level programs list is keyed by catalog id, while req_tree.programs[].id is the audit version id, with a separate program_id field carrying the catalog id back.

A single catalog program resolves to one audit version at a time, the one matching the student’s level and campus. For example, catalog id 428 (“BS in Mathematical Sciences”) resolves to audit version 2881253 for the EY2021 Pittsburgh undergraduate audit. Catalog programs without a compatible audit version for the requesting student fail add-program/ with code 400.

Types

The type integer is only exposed by getprograms/, and once a program is added to a plan the label disappears. The five values:

typenamecountexamples
1Major1139“BS in Mathematical Sciences”, “BFA in Music Performance (Tuba)”, “PHD in Architecture-Engineering-Construction Management”, “MS in Software Engineering”
2Minor211“Minor in Computer Science”, “Minor in Decision Making and Intelligent Systems”, “Minor in Music Theory”
3Additional Major144“Additional Major in Computer Science”, “Additional Major in International Relations and Politics”
4Sub-requirement bundle633“Mellon College of Science - General Education - EY2021+”, “EY2019 - SCS General Education - Cognitive Studies Requirement (AI)”, “Health Information Systems Concentration - BS in Information Systems”, “EY2017 Drama - MFA PTM Requirements”
5Eligibility track2“Undergraduate Pre-Health”, “Sigma Tau Delta Eligibility”

Type 1 is the actual degree program a student majors in. Type 3 is a way to declare a second major and is typically a slimmer requirement set than the equivalent type 1. Type 4 is a partial requirement set intended to combine with a major: per-college and per-AY-version gen-ed, concentration tracks within a major, and college-wide degree checks. Type 5 is a reusable eligibility track that lives alongside a major.

Auto-attach

Adding a major (type 1) to a plan does not just include the major’s own requirements: the server also auto-attaches the relevant college gen-ed and degree-check programs (both type 4) into the audit. They appear in getauditinfo.programs[] with is_uni_req: true and is_shared: true, and their requirement trees show up under req_tree.programs[] like any other added program.

Concentration-specific type 4 bundles (e.g. “Health Information Systems Concentration …”) do not auto-attach into a plan-based audit; they only enter that audit if explicitly added or selected as the major’s concentration. The requirement scraper avoids this entirely by using the test-apply audit path against each catalog program’s published audit version directly (see stellic.md’s getauditinfo section), which works for every type uniformly.

Coverage ceiling

Of the 2129 catalog programs, 1291 are reachable from a student-role account via the test-apply audit path. The remaining 838 return zero audit versions at every status (published, draft, archived, unpublished, obsolete). Two separate undergraduate accounts produced identical coverage sets, so visibility on this Stellic deployment does not appear to depend on the student’s college or declared major within the undergraduate role.

Among the 838 unreachable:

  • 130 are “Student Defined” programs, which are per-student custom degrees with no shared audit by design.
  • 22 type-4 entries return a skeleton (no course satisfiers) via getprogramsrequirements/ even though getauditversions/ returns nothing. These are mostly older BSA/BHA/BEA gen-ed and concentration requirement bundles.
  • 686 return nothing on either endpoint. These are typically MS/PHD programs (MS in Economics, PHD in Applied Physics, MS in Computer Science, etc.), combined-degree programs (BSA in Mathematical Sciences and Music Performance), and a small set of older sub-requirement bundles.

It is not yet confirmed whether graduate-role accounts see additional audits for graduate programs that are unreachable from undergrad accounts. If they do, a multi-account scrape would broaden coverage. If they don’t, the remaining 838 require an admin/registrar role and are out of reach from any student-role account.

Requirement tree shape

The req_tree returned by getauditinfo is a recursive node structure:

req_tree
├── id, screen_name, constraints
└── programs: [
    {
      id (audit_version_id), screen_name, program_id (catalog_id),
      is_concentration, is_uni_req, is_shared, ...,
      constraints: [...],
      choices: [
        { id, screen_name, constraints, choices: [...] },        // sub-bucket
        { id, screen_name: "21-122", constraints },              // course leaf
        ...
      ]
    },
    ...
  ]

Course satisfiers are leaf nodes inside choices whose screen_name is a course code (e.g. “21-122”). Buckets like “Discrete Mathematics” carry their own choices array of further buckets and course leaves. The constraints array on a node holds unit minimums, GPA caps, and the other rule data for that node.

Padded requirement IDs

In unique_course_parents_mapping, each course requirement id maps to a {audit_version_id: padded_full_id} dict. The padded full id is the audit version id concatenated with the requirement id zero-padded to 13 digits, so a 7-digit audit version id produces a 20-digit padded id:

audit_version_id = 2881254
requirement_id   = 2628
padded full id   = 28812540000000002628

The padded form lets the engine identify a (program, requirement) pair with one integer. The scraper preserves padded ids inside the saved unique_course_parents_mapping, filtered to the requested audit version, so each course leaf can be cross-referenced back to its requirement node by integer id.

Syllabi (Canvas)

CMU’s Canvas instance hosts a “Syllabus Registry” course that aggregates instructor-uploaded syllabi across most departments and recent terms.

Source

Master course: https://canvas.cmu.edu/courses/sis_course_id:syllabus-registry (Canvas course id 3769). The course’s metadata is is_public: true, but listing files and pages requires an authenticated Canvas API token.

Token

Authenticate API requests with Authorization: Bearer <token> where the token is generated from https://canvas.cmu.edu/profile/settings under “Approved Integrations” -> “+ New Access Token”.

The token currently in use for development expires August 25 at 12:00 AM. After that, a new token is needed.

Structure

The registry is a three-level Canvas hierarchy:

  1. The master course (3769) contains 30 modules, one per term, named like Spring 2026 (S26) or Summer 2024 (M24). Term codes follow <season><yy> where season is S (spring), M (summer-1), N (summer-2), or F (fall). Coverage runs from Summer 2019 (M19).
  2. Each term module’s items are 60 ExternalUrl entries, one per department, with titles like Architecture (48XXX). Each item’s external_url points to a per-(term, department) sub-course identified by sis_course_id:syllabus-registry-<TERM>-<DEPT> (e.g. syllabus-registry-S26-ARC). At the time of writing, the registry contains 1797 such sub-courses across all terms.
  3. Each dept sub-course has up to four modules: Notice to Users, Available Syllabi, Unavailable Syllabi, Individualized Experiences. Items inside are either File (a Canvas file object that resolves to a downloadable PDF/DOC) or Page (a Canvas page).
  • “Available Syllabi” File items are each a real downloadable file (e.g. 48649_S26_designleadership_mcnutt.pdf, content type application/pdf).
  • “Available Syllabi” Page items contain only a small JSON redirect blob like <div id="syllabus-source" style="display: none;">{"canvas_course_id":"52739",...}</div>. The actual syllabus lives on the regular Canvas course site for that class, which is generally enrollment-restricted, so most of these pointers are not retrievable from a normal student token.
  • “Unavailable Syllabi” pages are placeholders the registry generates for courses where the instructor never uploaded a syllabus. They contain no syllabus content.
  • “Individualized Experiences” entries are independent-study and similar courses. Most are placeholder pages.

Retrieval

To list every file in the registry, we walk:

  1. GET /api/v1/courses/3769/modules?include[]=items&per_page=100 to get every term module with its dept items.
  2. For each term-module item, the external_url ends in sis_course_id:<id>. Hit GET /api/v1/courses/sis_course_id:<id>/modules?include[]=items&per_page=20 for each.
  3. Within each sub-course’s Available Syllabi module, every File-typed item has a url like /api/v1/courses/<id>/files/<file_id>. Fetch that to get the file metadata, including a url field that is the actual downloadable URL with a verifier query parameter.

Pagination is via the standard Canvas Link header; modules and items use per_page up to 100. The ?per_page=100 parameter is a soft hint — Canvas may return fewer.

Direct file listing on /api/v1/courses/3769/files returns 403, so the module-walk is the only way to enumerate syllabi from this course.

CLI

The scraper has a syllabi mode that walks the registry and saves every File and Page item from each sub-course’s Available Syllabi module:

cargo run --release -- --mode syllabi --canvas-token <token>

The token can also be supplied via CANVAS_TOKEN env. Output goes to data/syllabi/<term>/<dept>/<course_section>.<ext> (configurable via --syllabi-dir). File items are saved as the original PDF/DOC. Page items are saved as <course_section>.url, a plain-text file containing the Canvas page URL where the syllabus is rendered. We do not try to dereference the page’s syllabus-source redirect or extract a single canonical URL from the target course’s syllabus_body, because in a 100-item sample 72% of pages had an empty syllabus_body, only 5% had a single href, and 24% had between 2 and 28 hrefs that mixed the actual syllabus with mailto links, in-page anchors, Zoom URLs, and supplementary readings. There is no reliable way to identify “the” syllabus URL from those, so we save the Canvas page link and let downstream decide.

Catalog binary format

The catalog file (catalog.bin) is a region-based bincode encoding wrapped in an outer gzip layer. The browser unwraps the gzip with the native DecompressionStream API; native callers use flate2. Region bodies inside the wrap are uncompressed, so wasm reads them without an additional decode pass.

File layout

After gzip is removed:

[16-byte header]
  "CIDX"             4 bytes
  version u32 LE     4 bytes (= 4)
  flags u32 LE       4 bytes (reserved, currently 0)
  region_count u32   4 bytes

[region table, region_count * 40 bytes]
  name              16 bytes ASCII, NUL-padded
  body_offset u64    8 bytes
  body_len u64       8 bytes
  flags u32          4 bytes (reserved)
  reserved u32       4 bytes

[region bodies, concatenated, in table order]

The magic number is CIDX and the current format version is 4. Loading rejects any other version.

Regions

RegionBody
coursesbincode of Vec<Course>
professorsbincode of Vec<Professor>
sectionsbincode of Vec<SectionTime>
fce_rowsbincode of Vec<FceRow>
prebuilt_textbincode of PrebuiltText (optional)

The prebuilt_text region carries the FST bytes plus the postings arena that the text index would otherwise rebuild on each load. With it present, building the in-memory index skips the most expensive phase.

Cold-start subset

read_catalog_minimal_from_slice reads only the regions needed to drive search and lookup: courses, sections, prebuilt_text. The browser uses this on initial load to skip the bincode work for professors and fce_rows, which the search UI does not consult. A subsequent call to read_catalog_from_slice (or a targeted region read) hydrates the rest if it is needed.

Storage abstractions

The reader is generic over a CatalogStorage trait so the same code path works for in-memory slices and on-disk files. read_range returns Cow<[u8]>, so backends that already hold the bytes hand back a borrowed slice with no copy or allocation. Three concrete impls live in binary/storage.rs:

  • MemoryStorage borrows a &[u8], used by the wasm crate after DecompressionStream decodes into a typed array.
  • OwnedMemoryStorage owns a Vec<u8>, used when the native side has already pulled the entire file into memory (e.g. after gunzipping a transit blob).
  • FileStorage mmaps the file. Region reads return borrowed slices into the mapped buffer; the kernel pages bytes in on demand and shares the mapping across reads.

Format versioning

FORMAT_VERSION bumps any time the wire format changes (region layout, header, or any region body that is not strictly additive). The reader compares it on every load and returns an error rather than silently misreading a future version. The public courses-api exposes the active version under /v1/version so external consumers can pin against it.

Text search

Each course is indexed across five fields with distinct BM25 weights:

FieldWeightSource
code5.0course code (e.g. 15-122), tokenized into prefix forms so partial codes match
name2.0course title
instructor_names1.5section instructor list
description1.0catalog description
prereqs_text1.0prereq prose, when present

Tokenization is field-specific. Course codes go through tokenize_code, which emits prefix variants so that 15, 15-1, and 15-122 are all reachable. Description, name, and prereq text use a general tokenizer that lowercases and strips punctuation; instructor names use the same general path with diacritic folding.

FST term dictionary plus baked postings

Per field, the index holds an FST mapping each term to an offset in a posting-list arena. A posting list is u32 df followed by df entries of (u32 doc_id, f32 score). Eight bytes per posting, no per-query allocation.

The f32 score is baked at build time, not computed at query time:

field_weight * idf * (tf * (K1 + 1)) / (tf + K1 * (1 - B + B * dl / avg_dl))

with K1 = 1.2 and B = 0.75. Query scoring is “read 8 bytes per posting, add to accumulator.” This is the dominant reason BM25 queries finish in microseconds even with a few thousand candidates per term.

Levenshtein typo tolerance

For tokens at least four characters long, the FST runs a Levenshtein automaton with edit distance one (or two for longer tokens). Fuzzy hits get a 0.5 score multiplier so exact matches always rank ahead. The “did you mean” hint surfaces the lowest-distance code variant when the user’s literal query returns nothing.

Prebuilt text region

The prebuilt_text region in the catalog binary stores both the FST bytes and the posting arena per field, so loaders skip the most expensive build phase. Index::build_with_prebuilt_text reuses these bytes directly; Index::build is only needed when generating a fresh catalog.bin from a Corpus.

Facets and numeric filters

The index splits filters into two shapes by data type. Categorical filters live in FacetIndex; numeric range filters live in NumericIndex.

Categorical facets

Each facet axis interns its values to dense u16 ids at build time and stores a Vec<RoaringBitmap> indexed by id. Lookup is one indexed access plus a HashMap<String, u16> (or HashMap<T, u16> for integer axes) to translate user input. RoaringBitmap intersections drive the AND across multiple selected values within an axis, and across multiple axes within a query.

Axes the index currently exposes:

  • dept (string) — first segment of the course code, e.g. 15, 21
  • level (string) — 100, 200, …, or grad
  • school (string) — SCS, CIT, MCS, etc.
  • attribute_tags, gened_tags, skills (string, multi-valued per course)
  • has_syllabus_terms (string, multi-valued)
  • units_int (integer) — discrete units bucket

Multi-valued axes simply OR together the per-value bitmaps within a single course before pushing into the postings layer.

Smart facet pruning

When a query asks for facet counts (Query.with_facet_counts(true)), the index sorts axis values by current set size and returns at most 100 entries per axis. Long-tail values stay reachable via direct lookup but do not flood the response payload.

Numeric ranges

NumericIndex keeps three flavors that match the catalog data:

  • NumericFieldF32 — used for FCE-derived means (workload, instruction quality).
  • NumericFieldU32 — used for pagerank (scaled), section IDs, etc.
  • NumericFieldU16Optional — used for units (some courses have variable/no units).

Each variant stores values in a flat Vec keyed by doc id, so a range filter is a linear scan with predicate, intersected with whatever bitmap the rest of the query produces.

Interning of low-cardinality fields

Course fields with few distinct values use Arc<str> interning across the catalog: description (rare duplicates from cross-listed courses), level, school, and the multi-valued tag fields. The interning pool lives in the courses region itself; bincode emits each unique string once and references it by index.

Query engine

Queries flow through Searcher::query. A Searcher owns reusable scratch buffers (a per-doc f32 accumulator with epoch-based dirty tracking, plus a Vec<u32> of touched doc ids), so successive queries do not re-allocate or zero 9k+ entries.

Query shape

#![allow(unused)]
fn main() {
pub struct Query {
    pub text: Option<String>,
    pub fuzzy: bool,
    pub facets: FacetFilters,
    pub numeric: NumericFilters,
    pub sort: SortOrder,
    pub limit: usize,
    pub offset: usize,
    pub count_facets: Vec<FacetAxis>,
}
}

The wasm and REST surfaces both serialize this struct directly; both accept either text-only, filter-only, or combined queries. With no text, BM25 is skipped entirely and the engine sorts the filter intersection by the chosen SortOrder.

Sort orders

VariantBehavior
RelevanceBM25 with PageRank tiebreak. Falls back to PageRankDesc when no text query is present.
PageRankDescSort by precomputed PageRank, descending. Used for browse-style listings.
FceHrsPerWeekAscLightest workload first, requires the FCE hours field to be present.
FceInterestDesc, FceOverallTeachingDescFCE rating ranks.
CourseNumAsc, CodeAscAlphabetical / numeric on the course identifier.

Score blending

For text queries, BM25 scores from the postings layer multiply by 1 + alpha * pagerank_normalized with alpha = 0.2. PageRank breaks ties on otherwise equal BM25, so algorithms shows 15-451 ahead of an obscure 1-credit elective with the same baseline score. Pure browse (no text) uses raw PageRank descending.

Top-K early termination

With Relevance sort and a known limit, the engine maintains a BinaryHeap<(NotNan<f32>, doc_id)> of size limit and prunes any candidate whose maximum possible remaining score cannot displace the current heap minimum. For typical limits (10-50), the heap kicks in almost immediately and the linear pass over candidates short-circuits.

Facet counts and “did you mean”

When count_facets is non-empty, the engine produces per-axis tallies after applying the filter intersection. Counts are smart-pruned (sorted by size, capped at 100). The did_you_mean_codes field on the result populates when the literal query parses as a course code and one or more digit permutations of the trailing identifier match courses in the catalog (15-122 returning 15-122 and 15-127 for example).

Result cache

Searcher::with_cache_capacity enables an LRU result cache keyed by the bincode-serialized Query. Cache hits skip the entire query pipeline and return a clone of the previous result. Cache size is per-searcher and explicit; the wasm surface uses a 64-entry cache and the REST API a larger one.