Transparency

Honest bias audit

We publish what the corpus over- and under-represents, so downstream models inherit known blind spots instead of hidden ones.

Well represented

~78% records

Concrete & reinforced structuresDBN V.2.6-98 family

92%

Normative formal registerДБН / ДСТУ text

88%

Central dialect regionKyiv, Cherkasy, Vinnytsia

81%

Material terminologyNER label: material

76%

Structural engineeringLoad-bearing, foundations

72%

Under-represented

~22% records

Western dialect regionZakarpattia, Halychyna, Bukovyna

28%

Field speech registerOn-site informal

24%

Female speakersSpeech subset gender balance

18%

Restoration / heritageDomain underrepresented

14%

Southern dialect regionOdesa, Mykolaiv, Kherson

11%

How we measure this. Strength ratio = share of corpus where a label, register or region is present vs. the balanced expectation (uniform across 27 regions, uniform across 8 NER types, 50/50 gender split). We publish the underweighted categories up front.

How we close the gap. The annotation platform deliberately boosts tasks from underweighted categories in the queue, and the Contribute pathways for speech explicitly prioritise underrepresented regions.

Well represented

Under-represented

Join BUDOVA