LINGUA Open CallMicrosoft AI for Good

Building Ukrainian Construction Language for AI

BUDOVA: Building Ukrainian Domain-Specific, Open Voice & Text Archives — the first open Ukrainian construction-domain language dataset for speech and text to advance AI technologies in Ukraine.

Explore Project Open Data

65+

hours of speech (target)

65M+

text tokens (target)

341B

USD reconstruction

CC-BY 4.0

license

BUDOVA Pipeline01 / 04

input

tokens

арматурнийкаркасфундаменту

entities

MATарматурний

ELMкаркас

ELMфундаменту

confidence

—

Processing···

Initiative

LINGUA Open Call

A Microsoft AI for Good Lab initiative under EU Digital Unlock, supporting digital inclusion for low-resource European languages and building language resources for 10 European languages.

Microsoft

Up to $50,000 Funding

Grant support for language data collection for low-resource European languages.

Azure Compute Resources

Cloud computing credits for up to 2 years for data processing and validation.

Technical Support

Research collaboration with AI for Good Lab, EPFL, and ETH Zürich.

Open Models

Integration with Apertus, EuroLLM, SmolLM3 and other multilingual models.

Challenge

Why This Is Critical

Despite 30–46 million speakers, Ukrainian remains critically underrepresented in AI technologies — especially in specialized domains.

<0.6%

of web content in Ukrainian

Digital Exclusion

No technical language datasets, speech recognition systems for construction sites, or AI tools supporting Ukrainian building codes.

dialect groups, 15+ subdialects

Dialect Diversity

Northern, Southwestern (Volynian, Galician, Podilian, Transcarpathian, Bukovynian, Hutsul, Boiko, Lemko), and Southeastern dialect groups remain undocumented in technical contexts — AI systems risk failure for non-standard speakers.

15–25%

performance gap

Morphological Complexity

Cyrillic script, 7 cases, 3 genders, moveable stress — unique tokenization challenges with a performance gap vs. English.

What We Build

Open Resources for Construction

The first comprehensive Ukrainian technical language dataset with dialectal coverage across all major regions of Ukraine.

Speech

Speech Dataset

100+ hours of annotated construction speech with transcriptions, covering Northern, Southwestern, and Southeastern Ukrainian dialects.

<25%WER

>0.70Cohen's κ

Text

Text Corpus

100M+ tokens of construction documentation — DBN codes, technical specifications, safety protocols with NER annotations.

>0.85NER F1

5–10Knew terms

Infrastructure

Technical Infrastructure

Baseline ASR and NER models, data processing pipelines, documentation per Datasheets for Datasets standards. Available on Hugging Face & Zenodo.

WAV/FLACLossless

DOIPersistent

Projected Impact

Speech

100h+

Text

100M+

NER F1

>0.85

ASR WER

<25%

Roadmap

24 Months, 5 Phases

Phased deployment with quarterly releases for community feedback.

01 / 05Phase 1

Infrastructure & Preparation

Months 1–3 · Jan – Mar 2026

Infrastructure setup, partnership agreements, GDPR protocols, participant recruitment.

02 / 05Phase 2

Data Collection

Months 4–15 · Apr 2026 – Mar 2027

Phased collection: 25h speech + 25M tokens quarterly until reaching 100+h and 100M+ tokens.

03 / 05Phase 3

Annotation & Validation

Months 6–18 · Parallel with collection

Two-stage validation: crowdsourced transcription, then expert review. Inter-annotator agreement >0.70.

04 / 05Phase 4

Baseline Model Development

Months 12–21 · Jan – Sep 2027

Training ASR and NER models on collected data. Achieving WER <25% for construction terminology.

05 / 05Phase 5

Release & Transfer

Months 21–24 · Sep – Dec 2027

Final dataset release on Hugging Face and Zenodo with DOIs. Documentation, sustainability transition.

Datasets

Open Research Resources

Large-scale datasets and benchmarks for training, evaluating, and testing Ukrainian construction-domain NLP models.

Speech Dataset

100+hours of speech

Annotated construction speech recordings with transcriptions covering all major Ukrainian dialects — on-site discussions, briefings, and consultations from across Ukraine.

WAV / MP3 / WebM48 kHz monoTarget SNR ≥ 20 dB

Hugging Face

Text Corpus

100M+text tokens

Construction documentation with NER annotations — DBN codes, technical specifications, safety protocols, and project documentation.

JSON-linesNER annotationsStructured

Hugging Face

Construction Lexicon

5–10Kdomain terms

Domain-specific terminology covering building materials, techniques, safety standards, and regulatory language across construction sub-domains.

BilingualStructuredSearchable

Hugging Face

LanguageUkrainian (uk-UA) · Northern, Southwestern & Southeastern dialects

LicenseCC-BY 4.0 (data) · Apache 2.0 (models) · MIT (code)

HostingHugging Face · Zenodo (DOI) · GitHub

PrivacyGDPR · Voice anonymization · PII removal

Examples

Live examples from the corpus.

How BUDOVA annotates Ukrainian construction language — entity, register, and context labels (illustrative examples; the fine-tuned model ships with v1.0).

«Monolithic reinforced-concrete load-bearing structures of elevated responsibility class must be designed accounting for seismic loads per DBN V.1.1-12:2014.»

Register: Formal · normative
Material: Monolithic reinforced concrete
Source: DBN V.1.1-12:2014
NER tags: 4 entities · 0 ambiguities

Open Resources

Full Transparency. No Restrictions.

All outputs — data, models, code — under open licenses.

Datasets

Speech and text corpora with full annotations, metadata, and a construction lexicon of 5–10K terms.

CC-BY 4.0Hugging FaceZenodo

Baseline Models

Fine-tuned ASR model (Whisper / Wav2Vec2) and NER models for the construction domain with reproducible scripts.

Apache 2.0GitHubHF Models

Infrastructure

Processing pipelines, anonymization utilities, validation workflows — for replication by other low-resource projects.

MIT LicenseGitHubDocs

Ready to Get Started?

Download our datasets from Hugging Face and start building with Ukrainian construction language data.

Browse All Datasets

How to Cite

If you use BUDOVA resources in your research, please cite:

@misc{budova2026,
  title     = {BUDOVA: Building Ukrainian Domain-Specific, Open Voice & Text Archives},
  author    = {Dolhopolov, Serhii},
  year      = {2026},
  publisher = {Hugging Face & Zenodo},
  license   = {CC-BY-4.0},
  url       = {https://huggingface.co/datasets/budova}
}

Deliverables

Project Outputs

Concrete research outputs and open resources produced throughout the project lifecycle.

01 / 04

3datasets

Dataset Releases

Versioned speech and text datasets published on Hugging Face and Zenodo with full documentation and DOIs.

25%ETA Q4 2026

Infrastructure & GDPR protocols
First 25h speech collected
Text corpus v0.1

02 / 04

2+models

Baseline Models

Fine-tuned ASR (Whisper / Wav2Vec2) and NER models for Ukrainian construction domain with reproducible scripts.

5%ETA Q2 2027

Model architecture selected
Training pipeline setup
Benchmark evaluation

03 / 04

3+papers

Scientific Papers

Peer-reviewed publications at NLP and AI conferences documenting methodology, benchmarks, and findings.

10%ETA Q3 2027

Literature review completed
Methodology paper draft
Conference submission

04 / 04

100%open-source

Open-Source Tools

Processing pipelines, anonymization utilities, validation workflows, and documentation per Datasheets for Datasets standards.

15%ETA Q4 2027

Repository structure created
Anonymization pipeline
Validation workflows

Research Group

Project Team

An interdisciplinary team with expertise in AI, NLP, construction, cybersecurity, and data management.

01 / 01Project Team Lead

Serhii Dolhopolov

KNUCA AI Lab · Ph.D. Student in Computer Science

AI researcher and entrepreneur specializing in Natural Language Processing. Founder of KernelGlide — AI solutions for 10+ clients in construction. Principal Investigator on state-funded research (UAH 2.9M) in multimodal content analysis. Author of the textbook "Modeling AI Tasks" (546 pp.).

Portfolio Email

Top Contributors

01
Be the first contributor!
/annotate
00annot.

Host institution

Kyiv National University of Construction and Architecture

Founded in 1930 · 95 years of leadership in construction education & research

knuba.edu.ua

Join the work

Three ways in

All pathways

Annotate texts

Work through pending NER tasks in the annotation platform — highlight material, tool, process, measurement, structure and safety spans in real construction documents.

Submit source text

Share construction documents you have rights to distribute — ДБН excerpts, specifications, estimates, field notes. Sources are reviewed by admin and turned into annotation tasks.

Record dialect speech

Contribute 30-second to 2-minute recordings of on-site speech in your local dialect. Anonymous speaker ID, clear consent, contributes to an open dialect-balanced corpus.

Where it comes from

Regional corpus coverage

Open the map

— / 25

Oblasts covered

—

Speech records

—

Hours of audio

—

Unique speakers

Versus alternatives

Why BUDOVA

Full comparison

Criterion	BUDOVA	UberText 2.0
Speech + text
NER annotations		Partial
Multi-dialect speech	Planned: 27 regions
Annotation platform

What we under-represent

Honest limitations

Full audit

Well represented

Concrete & reinforced structures92%

Normative formal register88%

Central dialect region81%

Under-represented

Western dialect region28%

Field speech register24%

Female speakers18%

Recent

Latest releases

All releases

v0.4Apr 2026

Platform hardening. Password reset flow, inter-annotator agreement aggregation endpoint, coverage/provenance APIs, redesigned auth pages, email delivery via ACS…

Current

v0.3Apr 2026

Azure deployment complete: custom domain budov.org , nightly Postgres backups, Application Insights, Playwright smoke tests in CI/CD, PostHog product analytics…

v0.2Mar 2026

Annotation platform v1: task creation, NER annotator with span editing, skipped-task unskip flow, admin panel, speech upload + in-browser recording, lexicon edi…

Building Ukrainian Construction Language for AI

LINGUA Open Call

Up to $50,000 Funding

Azure Compute Resources

Technical Support

Open Models

Why This Is Critical

Digital Exclusion

Dialect Diversity

Morphological Complexity

Open Resources for Construction

Speech Dataset

Text Corpus

Technical Infrastructure

24 Months, 5 Phases

Infrastructure & Preparation

Data Collection

Annotation & Validation

Baseline Model Development

Release & Transfer

Open Research Resources

Speech Dataset

Text Corpus

Construction Lexicon

Live examples from the corpus.

Full Transparency. No Restrictions.

Datasets

Baseline Models

Infrastructure

Project Outputs

Dataset Releases

Baseline Models

Scientific Papers

Open-Source Tools

Project Team

Serhii Dolhopolov

Three ways in

Annotate texts

Submit source text

Record dialect speech

Regional corpus coverage

Why BUDOVA

Honest limitations

Latest releases

Join BUDOVA