Evaluation
Benchmark leaderboard
Model performance on the BUDOVA held-out test set. Ships alongside v1.0 with full fine-tune runs; numbers below are indicative targets.
Preview · v1.0 will have real runs| # | Model | NER F1 | Perplexity | Term acc. |
|---|---|---|---|---|
| 1 | BUDOVA-XLM-R-base (ours)Domain-adapted XLM-R, LoRA fine-tune on BUDOVA v1.0 | 0.891+0.27 | 11.2-18.10 | 0.880+0.22 |
| 2 | Liberta-UK-largeUkrainian LM with supervised NER head | 0.782+0.16 | 15.4-13.90 | 0.740+0.08 |
| 3 | XLM-R-large (zero-shot)No BUDOVA fine-tune | 0.651+0.03 | 22.4-6.90 | 0.520-0.14 |
| 4 | mBERT-baseMultilingual baseline | 0.612 | 26.8-2.50 | 0.480-0.18 |
| 5 | GPT-4o (few-shot)5-shot prompt, no fine-tune | 0.598 | — | 0.610-0.05 |
| 6 | Random baselineUniform label assignment | 0.072 | — | 0.160 |