Documentation

Technical documentation for the BUDOVA datasets, models, and tools.

Overview

BUDOVA (Building Ukrainian Domain-Specific, Open Voice & Text Archives) is a comprehensive Ukrainian construction-domain language dataset. This documentation covers the structure, usage, and technical details of all project resources.

All datasets are hosted on Hugging Face and Zenodo under open licenses.

Speech Dataset

The speech dataset target is 100+ hours of annotated construction speech recordings from across Ukraine, covering Northern (Polesian, Slobozhanian), Southwestern (Volynian, Galician, Podilian, Transcarpathian, Bukovynian, Hutsul, Boiko, Lemko), and Southeastern (Steppe) dialect groups. The corpus is under active collection — current size is exposed at /api/coverage.

Accepted upload formats: WAV, MP3, OGG, FLAC, WebM, M4A (≤ 50 MB per file). WAV (16-bit mono PCM at 48 kHz) is the preferred external-DAW export — see docs/recording-spec.md.

Browser recorder: 48 kHz mono WebM/Opus (or MP4/AAC on Safari), AGC / noise suppression / echo cancellation off.

Quality target: SNR ≥ 20 dB. WAV uploads are analyzed server-side (percentile-energy VAD) and the SNR is stored on the row; below-threshold uploads generate a qualityWarning but are not blocked.

Content types:

On-site discussions and briefings
Safety protocol readings
Technical consultations
Construction planning meetings

Annotations: manual transcription via the in-app editor; speaker IDs are auto-assigned pseudonyms (SPK-NNN); region tag = speaker dialect origin, not recording location.

Text Corpus

The text corpus contains 100M+ tokens of construction documentation with NER annotations.

Document types:

DBN building codes (ДБН)
Technical specifications
Safety protocols
Project documentation
Construction contracts

Format: JSON-lines with the following structure per document:

{
  "id": "doc_001",
  "text": "...",
  "source": "dbn",
  "entities": [
    {"start": 0, "end": 15, "label": "MATERIAL", "text": "..."}
  ],
  "metadata": {"year": 2024, "category": "safety"}
}

Construction Lexicon

A bilingual (Ukrainian–English) construction terminology dictionary containing 5–10K domain terms.

Coverage:

Building materials and composites
Construction techniques and methods
Safety standards and equipment
Regulatory and legal terminology
Tools, machinery, and instrumentation

Format: Structured JSON with Ukrainian term, English translation, definition, usage examples, and domain category.

Download Instructions

You can access BUDOVA datasets using the Hugging Face datasets library:

pip install datasets

Load the speech dataset:

from datasets import load_dataset

speech = load_dataset("budova/speech-uk-construction")
print(speech["train"][0])

Load the text corpus:

from datasets import load_dataset

corpus = load_dataset("budova/text-uk-construction")
print(corpus["train"].features)

Load the lexicon:

from datasets import load_dataset

lexicon = load_dataset("budova/construction-lexicon")
print(lexicon["train"][0])

Python Usage Examples

Fine-tune a Whisper model on BUDOVA speech data:

from transformers import WhisperForConditionalGeneration
from transformers import WhisperProcessor
from datasets import load_dataset

model = WhisperForConditionalGeneration.from_pretrained(
    "openai/whisper-small"
)
processor = WhisperProcessor.from_pretrained(
    "openai/whisper-small"
)

dataset = load_dataset("budova/speech-uk-construction")
# Preprocessing and training code follows
# See full examples in the GitHub repository

Run NER on construction text:

from transformers import pipeline

ner = pipeline(
    "ner",
    model="budova/ner-uk-construction",
    aggregation_strategy="simple"
)

text = "Застосування арматури класу A500C"
entities = ner(text)
print(entities)

License

BUDOVA resources are released under the following licenses:

Datasets: CC-BY 4.0
Models: Apache 2.0
Code: MIT License

When using BUDOVA resources, please cite the project. See the Open Data section for the BibTeX citation.