Hugging Face Datasets

Dataset Explorer

Name: BUDOVA Ukrainian Construction Speech Corpus
Creator: BUDOVA
License: https://creativecommons.org/licenses/by/4.0/

Preview and explore BUDOVA datasets before downloading.

budova/speech-uk-construction

Annotated construction speech — collection in progress, target 100+ hours

~45 GBWAV/FLAC + JSON-linesCC-BY 4.0train / validation / testASR, Speaker ID

Published with v1.0 — in collection now

budova/text-uk-construction

Construction documentation with NER — collection in progress, target 100M+ tokens

~2.1 GBJSON-linesCC-BY 4.0train / validation / testNER, Text Classification

Published with v1.0 — in collection now

budova/construction-lexicon

5-10K bilingual construction terms

~12 MBJSONCC-BY 4.0fullTranslation, Terminology

Published with v1.0 — in collection now

Sample Records

Preview of the text corpus data format.

No merged entries yet — showing illustrative samples.

ID	Text	Source	Entities	Category
doc_001	Застосування арматури класу A500C для фундаментних конструкцій згідно ДБН...	dbn	4	materials
doc_042	Монтаж опалубки перекриття виконується після перевірки несучої здатності...	spec	3	process
doc_108	Протипожежний захист сталевих колон забезпечується нанесенням вогнезахисного...	safety	5	safety
doc_215	Бетонна суміш класу C25/30 з додаванням пластифікатора для підвищення...	dbn	3	materials
doc_330	Геодезичний контроль вертикальності стін здійснюється теодолітом з точністю...	spec	2	process

Quick Start

pip install datasets

from datasets import load_dataset
ds = load_dataset("budova/speech-uk-construction")
print(ds["train"][0])

Data Format

{
  "text": "Застосування арматури класу A500C...",
  "labels": [
    {"start": 20, "end": 30, "label": "material", "text": "A500C"}
  ],
  "domain": "reinforcement",
  "validation_score": 0.95
}

Citation

If you use BUDOVA in research, please cite:

@misc{budova2026,
  title     = {BUDOVA: Building Ukrainian Domain-Specific, Open Voice & Text Archives},
  author    = {Dolhopolov, Serhii},
  year      = {2026},
  publisher = {Hugging Face & Zenodo},
  license   = {CC-BY-4.0},
  url       = {https://huggingface.co/datasets/budova}
}

Datasets are in active collection. Public Hugging Face releases ship with v1.0; live progress is on the Coverage page.

Live collection progress

Browse the full corpus