Load BUDOVA text corpus
Five lines of datasets.load_dataset() — plus how to filter by domain and inspect the NER annotations.
Coming soonRunnable Colab notebooks that demonstrate how to work with the BUDOVA corpus — load data, train baselines, evaluate.
Five lines of datasets.load_dataset() — plus how to filter by domain and inspect the NER annotations.
Coming soonBaseline named-entity recognition on the eight BUDOVA entity types. Includes evaluation on the held-out test set.
Coming soonEvaluate an off-the-shelf Ukrainian ASR on BUDOVA speech samples and compare WER across dialects and domains.
Coming soonA minimal BPE tokeniser primed on BUDOVA lexicon that reduces token count for construction-domain prompts.
Coming soonReproduce every number shown on the landing page — coverage, per-domain counts, IAA.
Coming soonPackage an annotated subset from the platform export and publish it to a Hugging Face dataset repository.
Coming soonNotebooks are authored and released alongside each dataset version. Links become active once v1.0 ships.
We are looking for researchers, construction professionals, and language specialists to participate in the project.