Introduction
We are excited to announce the official launch of the BUDOVA project — Building Ukrainian Domain-Specific, Open Voice & Text Archives. This initiative aims to create the first comprehensive, open-access Ukrainian construction-domain language dataset for AI research and applications.
Supported by Microsoft AI for Good Lab through the LINGUA Open Call, in collaboration with EPFL and ETH Zürich, BUDOVA addresses the critical gap in Ukrainian language AI resources, particularly in specialized technical domains.
Why BUDOVA Matters
Despite serving 30–46 million speakers, Ukrainian remains critically underrepresented in AI technologies. Less than 0.6% of web content is in Ukrainian, and virtually no technical language datasets exist for the construction domain.
Ukraine’s reconstruction needs — estimated at $524 billion — make AI-powered construction tools not just useful, but essential. BUDOVA will provide the foundational language resources needed to build these tools.
What We Are Building
BUDOVA will produce three key resources:
- Speech Dataset: 100+ hours of annotated construction speech recordings with transcriptions covering all major Ukrainian dialects (Northern, Southwestern, and Southeastern groups).
- Text Corpus: 100M+ tokens of construction documentation — DBN codes, technical specifications, safety protocols with NER annotations.
- Construction Lexicon: 5–10K domain-specific terms covering building materials, techniques, safety standards, and regulatory language.
All outputs will be released under open licenses (CC-BY 4.0 for data, Apache 2.0 for models, MIT for code) and hosted on Hugging Face and Zenodo.
Project Timeline
The project spans 24 months (January 2026 – December 2027) across five phases:
- Phase 1 (Months 1–3): Infrastructure setup, partnership agreements, GDPR protocols
- Phase 2 (Months 4–15): Phased data collection — 25h speech + 25M tokens quarterly
- Phase 3 (Months 6–18): Two-stage annotation and validation
- Phase 4 (Months 12–21): Baseline ASR and NER model development
- Phase 5 (Months 21–24): Final release on Hugging Face and Zenodo with DOIs
Get Involved
We are looking for researchers, construction professionals, and language specialists to participate in the project. Whether you are an NLP researcher, a construction engineer, or a linguist interested in Ukrainian, we welcome your contribution.
Visit our contact section to get in touch with the team, or explore the documentation to learn more about our datasets and methodology.