Up to $50,000 Funding
Grant support for language data collection for low-resource European languages.
BUDOVA: Building Ukrainian Domain-Specific, Open Voice & Text Archives — the first open Ukrainian construction-domain language dataset for speech and text to advance AI technologies in Ukraine.
A Microsoft AI for Good Lab initiative under EU Digital Unlock, supporting digital inclusion for low-resource European languages and building language resources for 10 European languages.
Grant support for language data collection for low-resource European languages.
Cloud computing credits for up to 2 years for data processing and validation.
Research collaboration with AI for Good Lab, EPFL, and ETH Zürich.
Integration with Apertus, EuroLLM, SmolLM3 and other multilingual models.
Despite 30–46 million speakers, Ukrainian remains critically underrepresented in AI technologies — especially in specialized domains.
No technical language datasets, speech recognition systems for construction sites, or AI tools supporting Ukrainian building codes.
Northern, Southwestern (Volynian, Galician, Podilian, Transcarpathian, Bukovynian, Hutsul, Boiko, Lemko), and Southeastern dialect groups remain undocumented in technical contexts — AI systems risk failure for non-standard speakers.
Cyrillic script, 7 cases, 3 genders, moveable stress — unique tokenization challenges with a performance gap vs. English.
The first comprehensive Ukrainian technical language dataset with dialectal coverage across all major regions of Ukraine.
100+ hours of annotated construction speech with transcriptions, covering Northern, Southwestern, and Southeastern Ukrainian dialects.
100M+ tokens of construction documentation — DBN codes, technical specifications, safety protocols with NER annotations.
Baseline ASR and NER models, data processing pipelines, documentation per Datasheets for Datasets standards. Available on Hugging Face & Zenodo.
Phased deployment with quarterly releases for community feedback.
Infrastructure setup, partnership agreements, GDPR protocols, participant recruitment.
Phased collection: 25h speech + 25M tokens quarterly until reaching 100+h and 100M+ tokens.
Two-stage validation: crowdsourced transcription, then expert review. Inter-annotator agreement >0.70.
Training ASR and NER models on collected data. Achieving WER <25% for construction terminology.
Final dataset release on Hugging Face and Zenodo with DOIs. Documentation, sustainability transition.
Large-scale datasets and benchmarks for training, evaluating, and testing Ukrainian construction-domain NLP models.
Annotated construction speech recordings with transcriptions covering all major Ukrainian dialects — on-site discussions, briefings, and consultations from across Ukraine.
Hugging FaceConstruction documentation with NER annotations — DBN codes, technical specifications, safety protocols, and project documentation.
Hugging FaceDomain-specific terminology covering building materials, techniques, safety standards, and regulatory language across construction sub-domains.
Hugging FaceHow BUDOVA annotates Ukrainian construction language — entity, register, and context labels (illustrative examples; the fine-tuned model ships with v1.0).
«Monolithic reinforced-concrete load-bearing structures of elevated responsibility class must be designed accounting for seismic loads per DBN V.1.1-12:2014.»
All outputs — data, models, code — under open licenses.
Speech and text corpora with full annotations, metadata, and a construction lexicon of 5–10K terms.
Fine-tuned ASR model (Whisper / Wav2Vec2) and NER models for the construction domain with reproducible scripts.
Processing pipelines, anonymization utilities, validation workflows — for replication by other low-resource projects.
Download our datasets from Hugging Face and start building with Ukrainian construction language data.
Browse All DatasetsIf you use BUDOVA resources in your research, please cite:
@misc{budova2026,
title = {BUDOVA: Building Ukrainian Domain-Specific, Open Voice & Text Archives},
author = {Dolhopolov, Serhii},
year = {2026},
publisher = {Hugging Face & Zenodo},
license = {CC-BY-4.0},
url = {https://huggingface.co/datasets/budova}
}Concrete research outputs and open resources produced throughout the project lifecycle.
Versioned speech and text datasets published on Hugging Face and Zenodo with full documentation and DOIs.
Fine-tuned ASR (Whisper / Wav2Vec2) and NER models for Ukrainian construction domain with reproducible scripts.
Peer-reviewed publications at NLP and AI conferences documenting methodology, benchmarks, and findings.
Processing pipelines, anonymization utilities, validation workflows, and documentation per Datasheets for Datasets standards.
An interdisciplinary team with expertise in AI, NLP, construction, cybersecurity, and data management.
AI researcher and entrepreneur specializing in Natural Language Processing. Founder of KernelGlide — AI solutions for 10+ clients in construction. Principal Investigator on state-funded research (UAH 2.9M) in multimodal content analysis. Author of the textbook "Modeling AI Tasks" (546 pp.).
We are looking for researchers, construction professionals, and language specialists to participate in the project.