Positioning

How BUDOVA compares

Against other Ukrainian-language corpora, across the criteria that matter for domain NLP research.

Criterion	BUDOVA	UberText 2.0	CC-100 UK	UA-GEC
Domain coverageConstruction terminology	Dedicated	General web	General web	Grammar errors
Speech + text
NER annotations8 entity types		Partial
Inter-annotator agreement	Fleiss κ reported	Not reported	Not applicable	κ reported
Multi-dialect speech	Planned: 27 regions
License	CC-BY 4.0	CC-BY 4.0	Mixed / fair use	CC-BY 4.0
Size (approx.)	Target: 100M tokens, 100h speech	6B tokens	~2B tokens	30k sentences
Annotation platformOpen contribution
Hugging Face availability

Corpus sizes as of latest release. BUDOVA targets narrow-domain depth; general-purpose corpora target breadth. The two complement each other for downstream Ukrainian NLP.

Join BUDOVA