This site uses essential browser storage for authentication and preferences. No tracking cookies are used. Privacy Policy
Annotation protocol

Guidelines

The public rulebook for annotating BUDOVA. Every rule is citable by anchor — deep-link individual conventions in papers and discussions.

Version: v0.2Sections: 12Rules: 43CC-BY 4.0

Scope

Guidelines for annotating Ukrainian construction-domain text with named-entity, register, and dialect metadata.

scope-1

What the corpus covers

BUDOVA covers construction-domain Ukrainian across four registers: normative documents (ДБН / ДСТУ), specifications, estimates, and field speech. Out-of-domain text is rejected before annotation.

Монолітні залізобетонні конструкції з класом бетону C25/30.
Учорашній матч чемпіонату України.Out of domain — reject at intake.
scope-2

What is annotated

Every source fragment receives NER labels (8 entity types), a register label, and — for speech samples — dialect region and speaker ID.

Text + NER spans + register=normative + source=ДБН В.1.1-12:2014
scope-3

What is not annotated

Figure captions, formulas, table headers, and numeric-only fragments without surrounding context are excluded. Annotate sentences that carry domain information, not typography.

Табл. 3Skip table markers.
ΣM_x = 0Skip isolated formulas.

Entity types

Eight NER labels cover the BUDOVA corpus. Definitions below are the ground-truth for every annotator.

entities-material

material

Construction substances, composites, and named products — concrete, rebar, insulation, window units, drywall, cement. Multi-word canonical names are kept whole.

монолітний залізобетон
двокамерний склопакет
бетонOnly-noun-without-qualifier is borderline; accept if canonical in context.
entities-tool

tool

Machinery, equipment, and instruments used on-site — excavator, crane, drill, trowel, scaffolding, formwork.

баштовий кран
зварювальний апарат
entities-process

process

Construction activities — casting, welding, installation, plastering, drilling, waterproofing. Prefer verbal-noun forms.

бетонування при температурі нижче +5 °C
роблять бетонColloquial verb form — normalize to "бетонування".
entities-measurement

measurement

Numeric quantities with units, class codes, dimensions, and concentration percentages. Always mark the unit as part of the span.

переріз 400×400 мм
клас бетону C25/30
400Bare number without unit is not a measurement.
entities-structure

structure

Structural and architectural elements — foundations, columns, beams, slabs, roofs, stairs, ramps.

несуча стіна підвалу
фундаментна плита
entities-safety

safety

Safety equipment, procedures, and protective measures — PPE, helmets, harnesses, fire extinguishers, evacuation routes, protective railings. Standard citations go to regulation; numeric protection classes go to property.

засоби індивідуального захисту
аварійне освітлення шляхів евакуації
ДБН А.3.1-5:2016A standard citation is regulation, not safety.
entities-regulation

regulation

Normative references — ДБН, ДСТУ, СОУ, ISO and EN standards, Eurocodes, technical specifications (ТУ). Keep the full code with the year as one span (see boundaries-3).

ДБН В.2.2-40:2018
ДСТУ EN 81-70
entities-property

property

Named technical properties and performance classes — fire-resistance class, thermal conductivity, sound-insulation index, load-bearing capacity, strength grade. A numeric value with its unit is a measurement; the property name itself is property.

клас вогнестійкості R120
несуча здатність
0.045 Вт/(м·К)Value with unit is measurement, not property.

Span boundaries

Where the annotation starts and ends — the single most common source of disagreement.

boundaries-1

Include modifiers

When a domain adjective or quantifier changes the meaning of the head noun, include it in the span.

монолітний залізобетон"monolithic" is domain-salient.
монолітний залізобетонLosing the modifier changes the sense.
boundaries-2

Exclude articles and fillers

Do not include bare articles, prepositions, or discourse fillers at the edges of a span.

у фундаменті
у фундаменті
boundaries-3

Keep canonical multi-word terms whole

Terms with a fixed collocation (e.g. ДБН + code, class + number) never split across annotations.

ДБН В.1.1-12:2014
ДБН В.1.1-12:2014

Nested entities

When one entity appears inside another.

nested-1

Annotate the outer span

If a longer material name contains a numeric class, mark the full material span; the measurement stays inside and is not separately annotated.

бетон класу C25/30One material span; C25/30 is part of it.
бетон класу C25/30Split annotation creates ambiguity.
nested-2

Split at natural boundaries

When two distinct entities abut, give each its own span.

фундаментна плита під колоною

Dialectal and colloquial terms

Handling regional vocabulary and informal speech.

dialect-1

Annotate the surface form

Use the exact dialectal word as it appears. Canonical mapping is a separate metadata field, not a substitution.

ґіпсокартонWestern variant; keep "ґ".
шпахлівкаDialectal; map to "шпаклівка" in metadata.
dialect-2

Tag dialectal origin

For speech samples, attach region + dialect group (Northern / Southwestern / Southeastern — the standard three-group Ukrainian taxonomy) to the utterance, not to individual spans.

Region tag is an utterance-level field in the speech subset schema.

Recording protocol

How to capture speech samples so they pass the corpus-quality bar. Server-side checks measure SNR and peak for WAV uploads; the in-browser recorder enforces format. Full contributor spec: docs/recording-spec.md.

record-1

Device & permissions

Use a dedicated mic if possible (USB condenser, lavalier, decent phone). Wear headphones to prevent speaker bleed. Grant microphone permission to the browser when prompted — the platform stops the audio track when recording ends, so the indicator should go away.

AGC / noise suppression / echo cancellation are turned off by the in-browser recorder. If you use an external DAW, disable those too.
record-2

Room conditions & SNR

Record in a quiet room (no HVAC blast, no traffic). Take 3–5 seconds of silence first and check the noise floor on your meter. Corpus target is SNR ≥ 20 dB; the upload response includes a warning when a WAV measures below.

Indoor office, doors closed, fan off → typical SNR 30–40 dB.
Construction site without a windscreen → may fall below 15 dB; note the noise level in source_ref.
record-3

Levels — peak −12 to −3 dBFS

Loud enough to be clearly above noise, quiet enough not to clip. Aim for peaks between −12 and −3 dBFS. The server rejects nothing on level alone but stores peak_dbfs so clipped takes can be filtered at export.

Peak ≥ −0.1 dBFS → clipped, distortion baked in.
Peak ≤ −40 dBFS → too quiet, low effective SNR.
record-4

Format & duration

Prefer mono 48 kHz 16-bit PCM WAV for external-DAW uploads. The in-browser recorder produces 48 kHz mono WebM/Opus at 64 kbit/s (Safari falls back to MP4/AAC). One utterance per file, 30 s – 3 min. Max 50 MB.

A 90-second mono WAV at 48 kHz of one speaker explaining a concrete pour.
A 12-minute stereo file with three speakers and background TV.
record-5

One speaker per file

Each recording should have a single speaker. If you are interviewing someone, split into one file per speaker. If a true dialogue is captured, note it in source_ref ("2 speakers, foreman + worker").

speaker_id is auto-assigned per contributor; mixing speakers inside one file breaks that assumption.
record-6

PII & content

No full names, addresses, phone numbers, license plates, or personal medical info. Dialectal vocabulary is welcome — tag your region so dialect researchers can find it. Construction-domain only.

"…шпахлівку накладали в три проходи з шліфуванням між ними."
"Я, Іван Петренко, проживаю за адресою…"PII; either reject the recording or re-record without identifying details.

Brand vs material

Distinguishing product brands from generic substance names.

brand-1

Brand stays with the material

When a brand name qualifies a material, annotate the full span as material and store the brand in a sub-field.

склопакет REHAU 70Material with brand-qualified subtype.
склопакет REHAU 70Do not split into two entities.
brand-2

Standalone brand → skip

A brand name without a construction-material context is not annotated.

REHAUAlone in the text — no material context.

Numeric values

Dimensions, classes, percentages, and compound measurements.

numeric-1

Always include the unit

A number without unit is not a measurement span. Include mm/cm/m/МПа/°C/%/class-code suffix in the annotation.

30 мм
25 МПа
30
numeric-2

Dimension products

For multi-axis dimensions (400×400 mm), keep the entire product in one span. Same for tolerance ranges.

400×400 мм
±5 мм
numeric-3

Class codes

Class labels (C25/30, XC3, B400C) are measurement spans; include the prefix letter and slash.

класу C25/30
клас експозиції XC3

Register labels

Classifying the stylistic register of each source fragment.

register-1

Normative

ДБН / ДСТУ / СОУ text and other standards — formal legal register with explicit numeric thresholds and "must/shall" modals.

Конструкції повинні проєктуватися з урахуванням…
register-2

Specification

Product datasheets, technical specs — numeric-heavy, brand references common, imperative and descriptive sentences.

Віконний блок ПВХ серії REHAU 70, розміри 1400×1600 мм.
register-3

Estimate

Cost estimates, work-item lines, unit-rate tables. Recognized by work-type verb phrases and per-unit coefficients.

Улаштування покриття з металочерепиці, коефіцієнт 1.08.
register-4

Field

On-site speech and informal field notes — dialectal vocabulary, approximate numbers, discourse fillers.

Стяжку лили в чотири, ну п’ять з хвостиком.

Ambiguity

What to do when a span could fit more than one label.

ambiguity-1

Prefer narrow label

If a term could be a material OR a structure, choose whichever is narrower in the current sentence.

"Стіна" in "бетонна стіна" → structure (the element); in "виготовлення стін" → structure (the element).
ambiguity-2

Flag for review

Tasks where two annotators disagree after the first pass go to the admin-review queue with a "needs adjudication" flag.

The agreement metadata records per-word agreement; items below the threshold are surfaced in the admin dashboard. See the Agreement & merging section.

Agreement & merging

How several annotators’ work becomes one dataset entry, and how the published agreement numbers are computed. Methodology version 2 (June 2026); entries merged earlier carry v1 metadata and are marked as such in exports.

agreement-1

Majority-vote consensus

Each task is annotated independently by N annotators (default 3). Spans are projected onto characters; a character keeps an entity label when at least ⌈N/2⌉ annotators assigned it. Consensus spans whose average per-character support falls below 0.6 are dropped from the merged entry — they stay visible in the per-word agreement table.

2 of 3 annotators mark "ДСТУ EN 81-70" as regulation → the span survives with agreement 0.67.
agreement-2

Fleiss’ κ over the entity window

The headline coefficient is Fleiss’ κ computed only over the entity window — character positions where at least one annotator marked an entity. Unannotated text does not enter the calculation, so the value is comparable between a two-sentence fragment and a multi-page clause: the same labeling pattern yields the same κ regardless of text length.

Full-text character-level κ (v1) is still stored as fleissKappa for reference, but it is length-sensitive: the dominant "outside" class inflates expected agreement on long texts and makes κ unstable on short ones.
agreement-3

Span-level F1

For every annotator pair, spans match when they share a label and overlap with IoU ≥ 0.5. The mean pairwise F1 measures whether annotators find the same entities — complementing κ, which measures positional consistency.

agreement-4

Composite validation score

validation_score = 0.35 · max(0, κ_entity) + 0.65 · F1. Span detection is weighted higher because it is what NER training quality depends on. Entries are colour-coded: ≥ 0.7 high, 0.4–0.7 medium, < 0.4 low (queued for adjudication).

agreement-5

Interpreting κ

κ ≤ 0 — no better than chance; 0.2–0.4 fair; 0.4–0.6 moderate; 0.6–0.8 substantial; > 0.8 almost perfect. κ = 1 with an empty entity window means all annotators agreed the fragment contains no entities.

References

External documents that formalize the domain conventions above.

ref-1

Primary standards

ДБН В.1.1-12:2014 (seismic design), ДБН В.2.6-98:2009 (concrete structures), ДБН А.3.1-5:2016 (organization of construction).

ref-2

NER label taxonomy

The eight-label set adapts Ontonotes-style named-entity taxonomy to the construction domain. Full rationale in the project paper (v1.0).

ref-3

Cite this document

When citing BUDOVA guidelines: "BUDOVA Annotation Guidelines v0.2 (2026)". Deep-link individual rules with the anchor, e.g. /guidelines#entities-material.

Collaboration

Join BUDOVA

We are looking for researchers, construction professionals, and language specialists to participate in the project.

Supported by
Microsoft AI for Good Lab