Positioning
How BUDOVA compares
Against other Ukrainian-language corpora, across the criteria that matter for domain NLP research.
| Criterion | BUDOVA | UberText 2.0 | CC-100 UK | UA-GEC |
|---|---|---|---|---|
| Domain coverageConstruction terminology | Dedicated | General web | General web | Grammar errors |
| Speech + text | ||||
| NER annotations8 entity types | Partial | |||
| Inter-annotator agreement | Fleiss κ reported | Not reported | Not applicable | κ reported |
| Multi-dialect speech | Planned: 27 regions | |||
| License | CC-BY 4.0 | CC-BY 4.0 | Mixed / fair use | CC-BY 4.0 |
| Size (approx.) | Target: 100M tokens, 100h speech | 6B tokens | ~2B tokens | 30k sentences |
| Annotation platformOpen contribution | ||||
| Hugging Face availability |
Corpus sizes as of latest release. BUDOVA targets narrow-domain depth; general-purpose corpora target breadth. The two complement each other for downstream Ukrainian NLP.