This site uses essential browser storage for authentication and preferences. No tracking cookies are used. Privacy Policy
Transparency

Honest bias audit

We publish what the corpus over- and under-represents, so downstream models inherit known blind spots instead of hidden ones.

Well represented

~78% records
Concrete & reinforced structuresDBN V.2.6-98 family
92%
Normative formal registerДБН / ДСТУ text
88%
Central dialect regionKyiv, Cherkasy, Vinnytsia
81%
Material terminologyNER label: material
76%
Structural engineeringLoad-bearing, foundations
72%

Under-represented

~22% records
Western dialect regionZakarpattia, Halychyna, Bukovyna
28%
Field speech registerOn-site informal
24%
Female speakersSpeech subset gender balance
18%
Restoration / heritageDomain underrepresented
14%
Southern dialect regionOdesa, Mykolaiv, Kherson
11%

How we measure this. Strength ratio = share of corpus where a label, register or region is present vs. the balanced expectation (uniform across 27 regions, uniform across 8 NER types, 50/50 gender split). We publish the underweighted categories up front.

How we close the gap. The annotation platform deliberately boosts tasks from underweighted categories in the queue, and the Contribute pathways for speech explicitly prioritise underrepresented regions.

Collaboration

Join BUDOVA

We are looking for researchers, construction professionals, and language specialists to participate in the project.

Supported by
Microsoft AI for Good Lab