This site uses essential browser storage for authentication and preferences. No tracking cookies are used. Privacy Policy

Ukrainian Construction Terminology: Bridging Language and Technology

Domain Challenges

Ukrainian construction terminology presents unique challenges for natural language processing. Unlike general-purpose Ukrainian text, construction documents mix formal literary Ukrainian with technical jargon, Soviet-era terminology inherited from Russian-language standards (ДБН/СНиП), and modern international terms adapted from English. A single concept — such as "reinforced concrete" — may appear as залізобетон, ж/б, or армований бетон depending on context and register.

Existing Ukrainian dictionaries and terminology databases cover general vocabulary well but have significant gaps in construction-specific terms. Many specialized terms have no standardized Ukrainian form — practitioners often use Russian calques or English borrowings. This inconsistency creates serious problems for NLP systems: tokenizers split compound terms incorrectly, NER models fail to recognize domain entities, and machine translation produces inaccurate or ambiguous output.

BUDOVA addresses these gaps by building a curated, structured lexicon from primary sources: official Ukrainian building codes (ДБН), university textbooks, and validated through expert review by practising construction engineers and terminologists.

Lexicon Structure

The BUDOVA construction lexicon is designed as a machine-readable, linguistically rich resource. Each entry contains the Ukrainian term, its English equivalent, morphological information (gender, declension pattern, stress), a definition in Ukrainian, and usage examples drawn from the text corpus. Terms are organized into a hierarchical domain taxonomy with top-level categories: Materials, Structural Elements, Processes, Equipment, Safety, and Regulatory.

The data format is structured JSON optimized for integration with NLP pipelines:

  • term_uk: canonical Ukrainian form with stress marking
  • term_en: English translation
  • pos: part of speech (noun, verb, adjective)
  • morphology: gender, number, declension class
  • definition_uk: Ukrainian-language definition
  • examples: array of attested usage examples with source references
  • domain: hierarchical category path (e.g., Materials > Concrete > Admixtures)
  • synonyms: variant forms, abbreviations, colloquial equivalents

The lexicon currently contains approximately 3,200 validated entries and is projected to reach 7,000–10,000 by the end of the data collection phase. All entries undergo expert review by at least two domain specialists.

NER Annotation Schema

The lexicon directly informs BUDOVA’s named entity recognition annotation schema. We define four primary construction-domain entity types, each with subtypes that capture the granularity needed for practical NLP applications:

  • MATERIAL: construction materials and composites (e.g., портландцемент, арматура А500С, екструдований пінополістирол)
  • ELEMENT: structural and architectural elements (e.g., несуча стіна, монолітне перекриття, фундаментна плита)
  • PROCESS: construction activities and methods (e.g., бетонування, монтаж опалубки, гідроізоляція)
  • PROPERTY: measurable characteristics and standards (e.g., міцність на стиск, клас вогнестійкості, морозостійкість F150)

Annotations follow the IOB2 tagging scheme and are stored in CoNLL format alongside the tokenized text. Each annotated document includes provenance metadata linking entities back to lexicon entries, enabling automatic consistency checks. The annotation guidelines were developed in consultation with EPFL and ETH Zürich partners to ensure compatibility with multilingual NER benchmarks.

To date, over 15,000 entity mentions have been annotated across 2,400 documents, with an inter-annotator agreement (span-level F1) of 0.91. The annotation team consists of six linguists with construction domain training, supervised by two senior terminologists.