This site uses essential browser storage for authentication and preferences. No tracking cookies are used. Privacy Policy
Positioning

How BUDOVA compares

Against other Ukrainian-language corpora, across the criteria that matter for domain NLP research.

CriterionBUDOVAUberText 2.0CC-100 UKUA-GEC
Domain coverageConstruction terminologyDedicatedGeneral webGeneral webGrammar errors
Speech + text
NER annotations8 entity typesPartial
Inter-annotator agreementFleiss κ reportedNot reportedNot applicableκ reported
Multi-dialect speechPlanned: 27 regions
LicenseCC-BY 4.0CC-BY 4.0Mixed / fair useCC-BY 4.0
Size (approx.)Target: 100M tokens, 100h speech6B tokens~2B tokens30k sentences
Annotation platformOpen contribution
Hugging Face availability

Corpus sizes as of latest release. BUDOVA targets narrow-domain depth; general-purpose corpora target breadth. The two complement each other for downstream Ukrainian NLP.

Collaboration

Join BUDOVA

We are looking for researchers, construction professionals, and language specialists to participate in the project.

Supported by
Microsoft AI for Good Lab