Evaluation

Benchmark leaderboard

Model performance on the BUDOVA held-out test set. Ships alongside v1.0 with full fine-tune runs; numbers below are indicative targets.

Preview · v1.0 will have real runs

#	Model	NER F1	Perplexity	Term acc.
1	BUDOVA-XLM-R-base (ours)Domain-adapted XLM-R, LoRA fine-tune on BUDOVA v1.0	0.891+0.27	11.2-18.10	0.880+0.22
2	Liberta-UK-largeUkrainian LM with supervised NER head	0.782+0.16	15.4-13.90	0.740+0.08
3	XLM-R-large (zero-shot)No BUDOVA fine-tune	0.651+0.03	22.4-6.90	0.520-0.14
4	mBERT-baseMultilingual baseline	0.612	26.8-2.50	0.480-0.18
5	GPT-4o (few-shot)5-shot prompt, no fine-tune	0.598	—	0.610-0.05
6	Random baselineUniform label assignment	0.072	—	0.160

Join BUDOVA