Colab tutorials

Notebooks

Runnable Colab notebooks that demonstrate how to work with the BUDOVA corpus — load data, train baselines, evaluate.

Quickstart

Load BUDOVA text corpus

Five lines of datasets.load_dataset() — plus how to filter by domain and inspect the NER annotations.

Runtime: 5 minHardware: CPU

Coming soon

Training

Baseline named-entity recognition on the eight BUDOVA entity types. Includes evaluation on the held-out test set.

Runtime: 45 minHardware: T4 GPU

Coming soon

Evaluation

Evaluate an off-the-shelf Ukrainian ASR on BUDOVA speech samples and compare WER across dialects and domains.

Runtime: 25 minHardware: T4 GPU

Coming soon

Utility

A minimal BPE tokeniser primed on BUDOVA lexicon that reduces token count for construction-domain prompts.

Runtime: 10 minHardware: CPU

Coming soon

Analysis

Reproduce every number shown on the landing page — coverage, per-domain counts, IAA.

Runtime: 15 minHardware: CPU

Coming soon

Publishing

Package an annotated subset from the platform export and publish it to a Hugging Face dataset repository.

Runtime: 20 minHardware: CPU

Coming soon

Notebooks are authored and released alongside each dataset version. Links become active once v1.0 ships.