This site uses essential browser storage for authentication and preferences. No tracking cookies are used. Privacy Policy
Colab tutorials

Notebooks

Runnable Colab notebooks that demonstrate how to work with the BUDOVA corpus — load data, train baselines, evaluate.

Quickstart

Load BUDOVA text corpus

Five lines of datasets.load_dataset() — plus how to filter by domain and inspect the NER annotations.

Runtime: 5 minHardware: CPU
Coming soon
Training

Fine-tune XLM-R on BUDOVA NER

Baseline named-entity recognition on the eight BUDOVA entity types. Includes evaluation on the held-out test set.

Runtime: 45 minHardware: T4 GPU
Coming soon
Evaluation

ASR evaluation on construction speech

Evaluate an off-the-shelf Ukrainian ASR on BUDOVA speech samples and compare WER across dialects and domains.

Runtime: 25 minHardware: T4 GPU
Coming soon
Utility

Lexicon-aware tokeniser

A minimal BPE tokeniser primed on BUDOVA lexicon that reduces token count for construction-domain prompts.

Runtime: 10 minHardware: CPU
Coming soon
Analysis

Corpus statistics and coverage

Reproduce every number shown on the landing page — coverage, per-domain counts, IAA.

Runtime: 15 minHardware: CPU
Coming soon
Publishing

Push your own subset to Hugging Face

Package an annotated subset from the platform export and publish it to a Hugging Face dataset repository.

Runtime: 20 minHardware: CPU
Coming soon

Notebooks are authored and released alongside each dataset version. Links become active once v1.0 ships.

Collaboration

Join BUDOVA

We are looking for researchers, construction professionals, and language specialists to participate in the project.

Supported by
Microsoft AI for Good Lab