This site uses essential browser storage for authentication and preferences. No tracking cookies are used. Privacy Policy
Annotation protocol
Guidelines
The public rulebook for annotating BUDOVA. Every rule is citable by anchor — deep-link individual conventions in papers and discussions.
Version: v0.2Sections: 12Rules: 43CC-BY 4.0
Scope
Guidelines for annotating Ukrainian construction-domain text with named-entity, register, and dialect metadata.
scope-1
What the corpus covers
BUDOVA covers construction-domain Ukrainian across four registers: normative documents (ДБН / ДСТУ), specifications, estimates, and field speech. Out-of-domain text is rejected before annotation.
Монолітні залізобетонні конструкції з класом бетону C25/30.
Учорашній матч чемпіонату України.Out of domain — reject at intake.
scope-2
What is annotated
Every source fragment receives NER labels (8 entity types), a register label, and — for speech samples — dialect region and speaker ID.
Text + NER spans + register=normative + source=ДБН В.1.1-12:2014
scope-3
What is not annotated
Figure captions, formulas, table headers, and numeric-only fragments without surrounding context are excluded. Annotate sentences that carry domain information, not typography.
Табл. 3Skip table markers.
ΣM_x = 0Skip isolated formulas.
Entity types
Eight NER labels cover the BUDOVA corpus. Definitions below are the ground-truth for every annotator.
entities-material
material
Construction substances, composites, and named products — concrete, rebar, insulation, window units, drywall, cement. Multi-word canonical names are kept whole.
монолітний залізобетон
двокамерний склопакет
бетонOnly-noun-without-qualifier is borderline; accept if canonical in context.
entities-tool
tool
Machinery, equipment, and instruments used on-site — excavator, crane, drill, trowel, scaffolding, formwork.
роблять бетонColloquial verb form — normalize to "бетонування".
entities-measurement
measurement
Numeric quantities with units, class codes, dimensions, and concentration percentages. Always mark the unit as part of the span.
переріз 400×400 мм
клас бетону C25/30
400Bare number without unit is not a measurement.
entities-structure
structure
Structural and architectural elements — foundations, columns, beams, slabs, roofs, stairs, ramps.
несуча стіна підвалу
фундаментна плита
entities-safety
safety
Safety equipment, procedures, and protective measures — PPE, helmets, harnesses, fire extinguishers, evacuation routes, protective railings. Standard citations go to regulation; numeric protection classes go to property.
засоби індивідуального захисту
аварійне освітлення шляхів евакуації
ДБН А.3.1-5:2016A standard citation is regulation, not safety.
entities-regulation
regulation
Normative references — ДБН, ДСТУ, СОУ, ISO and EN standards, Eurocodes, technical specifications (ТУ). Keep the full code with the year as one span (see boundaries-3).
ДБН В.2.2-40:2018
ДСТУ EN 81-70
entities-property
property
Named technical properties and performance classes — fire-resistance class, thermal conductivity, sound-insulation index, load-bearing capacity, strength grade. A numeric value with its unit is a measurement; the property name itself is property.
клас вогнестійкості R120
несуча здатність
0.045 Вт/(м·К)Value with unit is measurement, not property.
Span boundaries
Where the annotation starts and ends — the single most common source of disagreement.
boundaries-1
Include modifiers
When a domain adjective or quantifier changes the meaning of the head noun, include it in the span.
монолітний залізобетон"monolithic" is domain-salient.
монолітний залізобетонLosing the modifier changes the sense.
boundaries-2
Exclude articles and fillers
Do not include bare articles, prepositions, or discourse fillers at the edges of a span.
у фундаменті
у фундаменті
boundaries-3
Keep canonical multi-word terms whole
Terms with a fixed collocation (e.g. ДБН + code, class + number) never split across annotations.
ДБН В.1.1-12:2014
ДБН В.1.1-12:2014
Nested entities
When one entity appears inside another.
nested-1
Annotate the outer span
If a longer material name contains a numeric class, mark the full material span; the measurement stays inside and is not separately annotated.
бетон класу C25/30One material span; C25/30 is part of it.
бетон класу C25/30Split annotation creates ambiguity.
nested-2
Split at natural boundaries
When two distinct entities abut, give each its own span.
фундаментна плита під колоною
Dialectal and colloquial terms
Handling regional vocabulary and informal speech.
dialect-1
Annotate the surface form
Use the exact dialectal word as it appears. Canonical mapping is a separate metadata field, not a substitution.
ґіпсокартонWestern variant; keep "ґ".
шпахлівкаDialectal; map to "шпаклівка" in metadata.
dialect-2
Tag dialectal origin
For speech samples, attach region + dialect group (Northern / Southwestern / Southeastern — the standard three-group Ukrainian taxonomy) to the utterance, not to individual spans.
Region tag is an utterance-level field in the speech subset schema.
Recording protocol
How to capture speech samples so they pass the corpus-quality bar. Server-side checks measure SNR and peak for WAV uploads; the in-browser recorder enforces format. Full contributor spec: docs/recording-spec.md.
record-1
Device & permissions
Use a dedicated mic if possible (USB condenser, lavalier, decent phone). Wear headphones to prevent speaker bleed. Grant microphone permission to the browser when prompted — the platform stops the audio track when recording ends, so the indicator should go away.
AGC / noise suppression / echo cancellation are turned off by the in-browser recorder. If you use an external DAW, disable those too.
record-2
Room conditions & SNR
Record in a quiet room (no HVAC blast, no traffic). Take 3–5 seconds of silence first and check the noise floor on your meter. Corpus target is SNR ≥ 20 dB; the upload response includes a warning when a WAV measures below.
Indoor office, doors closed, fan off → typical SNR 30–40 dB.
Construction site without a windscreen → may fall below 15 dB; note the noise level in source_ref.
record-3
Levels — peak −12 to −3 dBFS
Loud enough to be clearly above noise, quiet enough not to clip. Aim for peaks between −12 and −3 dBFS. The server rejects nothing on level alone but stores peak_dbfs so clipped takes can be filtered at export.
Peak ≥ −0.1 dBFS → clipped, distortion baked in.
Peak ≤ −40 dBFS → too quiet, low effective SNR.
record-4
Format & duration
Prefer mono 48 kHz 16-bit PCM WAV for external-DAW uploads. The in-browser recorder produces 48 kHz mono WebM/Opus at 64 kbit/s (Safari falls back to MP4/AAC). One utterance per file, 30 s – 3 min. Max 50 MB.
A 90-second mono WAV at 48 kHz of one speaker explaining a concrete pour.
A 12-minute stereo file with three speakers and background TV.
record-5
One speaker per file
Each recording should have a single speaker. If you are interviewing someone, split into one file per speaker. If a true dialogue is captured, note it in source_ref ("2 speakers, foreman + worker").
speaker_id is auto-assigned per contributor; mixing speakers inside one file breaks that assumption.
record-6
PII & content
No full names, addresses, phone numbers, license plates, or personal medical info. Dialectal vocabulary is welcome — tag your region so dialect researchers can find it. Construction-domain only.
"…шпахлівку накладали в три проходи з шліфуванням між ними."
"Я, Іван Петренко, проживаю за адресою…"PII; either reject the recording or re-record without identifying details.
Brand vs material
Distinguishing product brands from generic substance names.
brand-1
Brand stays with the material
When a brand name qualifies a material, annotate the full span as material and store the brand in a sub-field.
склопакет REHAU 70Material with brand-qualified subtype.
склопакетREHAU 70Do not split into two entities.
brand-2
Standalone brand → skip
A brand name without a construction-material context is not annotated.
REHAUAlone in the text — no material context.
Numeric values
Dimensions, classes, percentages, and compound measurements.
numeric-1
Always include the unit
A number without unit is not a measurement span. Include mm/cm/m/МПа/°C/%/class-code suffix in the annotation.
30 мм
25 МПа
30
numeric-2
Dimension products
For multi-axis dimensions (400×400 mm), keep the entire product in one span. Same for tolerance ranges.
400×400 мм
±5 мм
numeric-3
Class codes
Class labels (C25/30, XC3, B400C) are measurement spans; include the prefix letter and slash.
класу C25/30
клас експозиції XC3
Register labels
Classifying the stylistic register of each source fragment.
register-1
Normative
ДБН / ДСТУ / СОУ text and other standards — formal legal register with explicit numeric thresholds and "must/shall" modals.
Віконний блок ПВХ серії REHAU 70, розміри 1400×1600 мм.
register-3
Estimate
Cost estimates, work-item lines, unit-rate tables. Recognized by work-type verb phrases and per-unit coefficients.
Улаштування покриття з металочерепиці, коефіцієнт 1.08.
register-4
Field
On-site speech and informal field notes — dialectal vocabulary, approximate numbers, discourse fillers.
Стяжку лили в чотири, ну п’ять з хвостиком.
Ambiguity
What to do when a span could fit more than one label.
ambiguity-1
Prefer narrow label
If a term could be a material OR a structure, choose whichever is narrower in the current sentence.
"Стіна" in "бетонна стіна" → structure (the element); in "виготовлення стін" → structure (the element).
ambiguity-2
Flag for review
Tasks where two annotators disagree after the first pass go to the admin-review queue with a "needs adjudication" flag.
The agreement metadata records per-word agreement; items below the threshold are surfaced in the admin dashboard. See the Agreement & merging section.
Agreement & merging
How several annotators’ work becomes one dataset entry, and how the published agreement numbers are computed. Methodology version 2 (June 2026); entries merged earlier carry v1 metadata and are marked as such in exports.
agreement-1
Majority-vote consensus
Each task is annotated independently by N annotators (default 3). Spans are projected onto characters; a character keeps an entity label when at least ⌈N/2⌉ annotators assigned it. Consensus spans whose average per-character support falls below 0.6 are dropped from the merged entry — they stay visible in the per-word agreement table.
2 of 3 annotators mark "ДСТУ EN 81-70" as regulation → the span survives with agreement 0.67.
agreement-2
Fleiss’ κ over the entity window
The headline coefficient is Fleiss’ κ computed only over the entity window — character positions where at least one annotator marked an entity. Unannotated text does not enter the calculation, so the value is comparable between a two-sentence fragment and a multi-page clause: the same labeling pattern yields the same κ regardless of text length.
Full-text character-level κ (v1) is still stored as fleissKappa for reference, but it is length-sensitive: the dominant "outside" class inflates expected agreement on long texts and makes κ unstable on short ones.
agreement-3
Span-level F1
For every annotator pair, spans match when they share a label and overlap with IoU ≥ 0.5. The mean pairwise F1 measures whether annotators find the same entities — complementing κ, which measures positional consistency.
agreement-4
Composite validation score
validation_score = 0.35 · max(0, κ_entity) + 0.65 · F1. Span detection is weighted higher because it is what NER training quality depends on. Entries are colour-coded: ≥ 0.7 high, 0.4–0.7 medium, < 0.4 low (queued for adjudication).
agreement-5
Interpreting κ
κ ≤ 0 — no better than chance; 0.2–0.4 fair; 0.4–0.6 moderate; 0.6–0.8 substantial; > 0.8 almost perfect. κ = 1 with an empty entity window means all annotators agreed the fragment contains no entities.
References
External documents that formalize the domain conventions above.
The eight-label set adapts Ontonotes-style named-entity taxonomy to the construction domain. Full rationale in the project paper (v1.0).
ref-3
Cite this document
When citing BUDOVA guidelines: "BUDOVA Annotation Guidelines v0.2 (2026)". Deep-link individual rules with the anchor, e.g. /guidelines#entities-material.
Collaboration
Join BUDOVA
We are looking for researchers, construction professionals, and language specialists to participate in the project.