This site uses essential browser storage for authentication and preferences. No tracking cookies are used. Privacy Policy
LINGUA Open CallMicrosoft AI for Good

Building Ukrainian Construction Language for AI

BUDOVA: Building Ukrainian Domain-Specific, Open Voice & Text Archives — the first open Ukrainian construction-domain language dataset for speech and text to advance AI technologies in Ukraine.

01
70+
hours of speech (target)
02
70M+
text tokens (target)
03
367B
USD reconstruction
04
CC-BY 4.0
license
BUDOVA Pipeline01 / 04
input
""
tokens
арматурнийкаркасфундаменту
entities
MATарматурний
ELMкаркас
ELMфундаменту
confidence
Processing···
Initiative

LINGUA Open Call

A Microsoft AI for Good Lab initiative under EU Digital Unlock, supporting digital inclusion for low-resource European languages and building language resources for 10 European languages.

Microsoft

Up to $50,000 Funding

Grant support for language data collection for low-resource European languages.

Azure Compute Resources

Cloud computing credits for up to 2 years for data processing and validation.

Technical Support

Research collaboration with AI for Good Lab, EPFL, and ETH Zürich.

Open Models

Integration with Apertus, EuroLLM, SmolLM3 and other multilingual models.

Challenge

Why This Is Critical

Despite 30–46 million speakers, Ukrainian remains critically underrepresented in AI technologies — especially in specialized domains.

<0.6%
of web content in Ukrainian

Digital Exclusion

No technical language datasets, speech recognition systems for construction sites, or AI tools supporting Ukrainian building codes.

3
dialect groups, 15+ subdialects

Dialect Diversity

Northern, Southwestern (Volynian, Galician, Podilian, Transcarpathian, Bukovynian, Hutsul, Boiko, Lemko), and Southeastern dialect groups remain undocumented in technical contexts — AI systems risk failure for non-standard speakers.

15–25%
performance gap

Morphological Complexity

Cyrillic script, 7 cases, 3 genders, moveable stress — unique tokenization challenges with a performance gap vs. English.

What We Build

Open Resources for Construction

The first comprehensive Ukrainian technical language dataset with dialectal coverage across all major regions of Ukraine.

Speech Dataset

100+ hours of annotated construction speech with transcriptions, covering Northern, Southwestern, and Southeastern Ukrainian dialects.

<25%WER
>0.70Cohen's κ

Text Corpus

100M+ tokens of construction documentation — DBN codes, technical specifications, safety protocols with NER annotations.

>0.85NER F1
5–10Knew terms

Technical Infrastructure

Baseline ASR and NER models, data processing pipelines, documentation per Datasheets for Datasets standards. Available on Hugging Face & Zenodo.

WAV/FLACLossless
DOIPersistent
Projected Impact
Speech
100h+
Text
100M+
NER F1
>0.85
ASR WER
<25%
Roadmap

24 Months, 5 Phases

Phased deployment with quarterly releases for community feedback.

01 / 05Phase 1

Infrastructure & Preparation

Months 1–3 · Jan – Mar 2026

Infrastructure setup, partnership agreements, GDPR protocols, participant recruitment.

02 / 05Phase 2

Data Collection

Months 4–15 · Apr 2026 – Mar 2027

Phased collection: 25h speech + 25M tokens quarterly until reaching 100+h and 100M+ tokens.

03 / 05Phase 3

Annotation & Validation

Months 6–18 · Parallel with collection

Two-stage validation: crowdsourced transcription, then expert review. Inter-annotator agreement >0.70.

04 / 05Phase 4

Baseline Model Development

Months 12–21 · Jan – Sep 2027

Training ASR and NER models on collected data. Achieving WER <25% for construction terminology.

05 / 05Phase 5

Release & Transfer

Months 21–24 · Sep – Dec 2027

Final dataset release on Hugging Face and Zenodo with DOIs. Documentation, sustainability transition.

Datasets

Open Research Resources

Large-scale datasets and benchmarks for training, evaluating, and testing Ukrainian construction-domain NLP models.

Speech Dataset

100+hours of speech

Annotated construction speech recordings with transcriptions covering all major Ukrainian dialects — on-site discussions, briefings, and consultations from across Ukraine.

WAV / MP3 / WebM48 kHz monoTarget SNR ≥ 20 dB
Hugging Face

Text Corpus

100M+text tokens

Construction documentation with NER annotations — DBN codes, technical specifications, safety protocols, and project documentation.

JSON-linesNER annotationsStructured
Hugging Face

Construction Lexicon

5–10Kdomain terms

Domain-specific terminology covering building materials, techniques, safety standards, and regulatory language across construction sub-domains.

BilingualStructuredSearchable
Hugging Face
LanguageUkrainian (uk-UA) · Northern, Southwestern & Southeastern dialects
LicenseCC-BY 4.0 (data) · Apache 2.0 (models) · MIT (code)
HostingHugging Face · Zenodo (DOI) · GitHub
PrivacyGDPR · Voice anonymization · PII removal
Examples

Live examples from the corpus.

How BUDOVA annotates Ukrainian construction language — entity, register, and context labels (illustrative examples; the fine-tuned model ships with v1.0).

«Monolithic reinforced-concrete load-bearing structures of elevated responsibility class must be designed accounting for seismic loads per DBN V.1.1-12:2014.»

Register
Formal · normative
Material
Monolithic reinforced concrete
Source
DBN V.1.1-12:2014
NER tags
4 entities · 0 ambiguities
Open Resources

Full Transparency. No Restrictions.

All outputs — data, models, code — under open licenses.

Datasets

Speech and text corpora with full annotations, metadata, and a construction lexicon of 5–10K terms.

CC-BY 4.0Hugging FaceZenodo

Baseline Models

Fine-tuned ASR model (Whisper / Wav2Vec2) and NER models for the construction domain with reproducible scripts.

Apache 2.0GitHubHF Models

Infrastructure

Processing pipelines, anonymization utilities, validation workflows — for replication by other low-resource projects.

MIT LicenseGitHubDocs
Ready to Get Started?

Download our datasets from Hugging Face and start building with Ukrainian construction language data.

Browse All Datasets
How to Cite

If you use BUDOVA resources in your research, please cite:

@misc{budova2026,
  title     = {BUDOVA: Building Ukrainian Domain-Specific, Open Voice & Text Archives},
  author    = {Dolhopolov, Serhii},
  year      = {2026},
  publisher = {Hugging Face & Zenodo},
  license   = {CC-BY-4.0},
  url       = {https://huggingface.co/datasets/budova}
}
Deliverables

Project Outputs

Concrete research outputs and open resources produced throughout the project lifecycle.

01 / 04
3datasets

Dataset Releases

Versioned speech and text datasets published on Hugging Face and Zenodo with full documentation and DOIs.

25%ETA Q4 2026
  • Infrastructure & GDPR protocols
  • First 25h speech collected
  • Text corpus v0.1
02 / 04
2+models

Baseline Models

Fine-tuned ASR (Whisper / Wav2Vec2) and NER models for Ukrainian construction domain with reproducible scripts.

5%ETA Q2 2027
  • Model architecture selected
  • Training pipeline setup
  • Benchmark evaluation
03 / 04
3+papers

Scientific Papers

Peer-reviewed publications at NLP and AI conferences documenting methodology, benchmarks, and findings.

10%ETA Q3 2027
  • Literature review completed
  • Methodology paper draft
  • Conference submission
04 / 04
100%open-source

Open-Source Tools

Processing pipelines, anonymization utilities, validation workflows, and documentation per Datasheets for Datasets standards.

15%ETA Q4 2027
  • Repository structure created
  • Anonymization pipeline
  • Validation workflows
Research Group

Project Team

An interdisciplinary team with expertise in AI, NLP, construction, cybersecurity, and data management.

01 / 01Project Team Lead

Serhii Dolhopolov

KNUCA AI Lab · Ph.D. Student in Computer Science

AI researcher and entrepreneur specializing in Natural Language Processing. Founder of KernelGlide — AI solutions for 10+ clients in construction. Principal Investigator on state-funded research (UAH 2.9M) in multimodal content analysis. Author of the textbook "Modeling AI Tasks" (546 pp.).

Top Contributors
  1. 01
    Be the first contributor!
    /annotate
    00annot.
Host institution
Kyiv National University of Construction and Architecture
Founded in 1930 · 95 years of leadership in construction education & research
knuba.edu.ua
Join the work

Three ways in

All pathways
01

Annotate texts

Work through pending NER tasks in the annotation platform — highlight material, tool, process, measurement, structure and safety spans in real construction documents.

02

Submit source text

Share construction documents you have rights to distribute — ДБН excerpts, specifications, estimates, field notes. Sources are reviewed by admin and turned into annotation tasks.

03

Record dialect speech

Contribute 30-second to 2-minute recordings of on-site speech in your local dialect. Anonymous speaker ID, clear consent, contributes to an open dialect-balanced corpus.

Where it comes from

Regional corpus coverage

Open the map
— / 25
Oblasts covered
Speech records
Hours of audio
Unique speakers
Versus alternatives

Why BUDOVA

Full comparison
CriterionBUDOVAUberText 2.0CC-100 UKUA-GEC
Speech + text
NER annotationsPartial
Multi-dialect speechPlanned: 27 regions
Annotation platform
What we under-represent

Honest limitations

Full audit
Well represented
Concrete & reinforced structures92%
Normative formal register88%
Central dialect region81%
Under-represented
Western dialect region28%
Field speech register24%
Female speakers18%
Recent

Latest releases

All releases
v0.4Apr 2026
Platform hardening. Password reset flow, inter-annotator agreement aggregation endpoint, coverage/provenance APIs, redesigned auth pages, email delivery via ACS…
Current
v0.3Apr 2026
Azure deployment complete: custom domain budov.org , nightly Postgres backups, Application Insights, Playwright smoke tests in CI/CD, PostHog product analytics…
v0.2Mar 2026
Annotation platform v1: task creation, NER annotator with span editing, skipped-task unskip flow, admin panel, speech upload + in-browser recording, lexicon edi…
Collaboration

Join BUDOVA

We are looking for researchers, construction professionals, and language specialists to participate in the project.

Supported by
Microsoft AI for Good Lab