Transparency
Honest bias audit
We publish what the corpus over- and under-represents, so downstream models inherit known blind spots instead of hidden ones.
Well represented
~78% recordsConcrete & reinforced structuresDBN V.2.6-98 family
92%Normative formal registerДБН / ДСТУ text
88%Central dialect regionKyiv, Cherkasy, Vinnytsia
81%Material terminologyNER label: material
76%Structural engineeringLoad-bearing, foundations
72%Under-represented
~22% recordsWestern dialect regionZakarpattia, Halychyna, Bukovyna
28%Field speech registerOn-site informal
24%Female speakersSpeech subset gender balance
18%Restoration / heritageDomain underrepresented
14%Southern dialect regionOdesa, Mykolaiv, Kherson
11%How we measure this. Strength ratio = share of corpus where a label, register or region is present vs. the balanced expectation (uniform across 27 regions, uniform across 8 NER types, 50/50 gender split). We publish the underweighted categories up front.
How we close the gap. The annotation platform deliberately boosts tasks from underweighted categories in the queue, and the Contribute pathways for speech explicitly prioritise underrepresented regions.