Data Collection Methodology: Building a Construction Speech Corpus

Published: February 1, 20263 min read

Recording Protocol

Capturing authentic construction speech requires carefully designed on-site recording protocols. Our field teams use professional-grade portable recorders (Zoom F6, Sound Devices MixPre-6 II) paired with directional lavalier microphones to isolate speech from heavy machinery noise. Each recording session follows a standardized workflow: environment assessment, equipment calibration, consent documentation, and structured plus free-form speech elicitation.

Sessions are conducted at active construction sites during various phases — foundation work, structural framing, finishing — to capture the full range of domain vocabulary. We record both planned speech (reading technical documents aloud) and spontaneous speech (on-site discussions, safety briefings, coordination calls) to ensure the corpus reflects real communicative patterns.

All recordings are tagged with metadata including:

Site type: residential, commercial, infrastructure
Construction phase: excavation, structural, MEP, finishing
Noise level: measured in dBA at the recording position
Recording conditions: indoor, outdoor, semi-enclosed

Speaker Recruitment

Speaker diversity is essential for building a robust ASR training corpus. BUDOVA recruits participants across three axes: regional dialect (Northern — Polesian, Slobozhanian; Southwestern — Volynian, Galician, Podilian, Transcarpathian, Bukovynian, Hutsul, Boiko, Lemko; Southeastern — Steppe), professional role (engineers, site supervisors, skilled tradespeople, safety officers), and demographics (age, gender). Our goal is a minimum of 200 unique speakers with balanced representation across all dialect groups.

Recruitment is conducted through partnerships with construction companies, trade unions, and vocational training centres in Kyiv, Lviv, Odesa, Dnipro, Kharkiv, Uzhhorod, Chernivtsi, Ivano-Frankivsk, Lutsk, and other oblasts. Each participant signs an informed consent form (available in Ukrainian) compliant with GDPR and Ukrainian data protection law, allowing their anonymized speech data to be released under CC-BY 4.0.

Speakers receive compensation for their time, and we conduct brief sociolinguistic questionnaires to capture self-reported dialect features, years of construction experience, and primary specialization. This metadata enables downstream analysis of dialect-specific ASR performance.

Quality Assurance

What ships today. WAV uploads run through an automated audio probe (server/utils/audio-quality.js) that decodes PCM, computes per-frame RMS with percentile-energy VAD, and reports SNR, peak (dBFS), noise floor, bit depth, and channel count. The corpus target is SNR ≥ 20 dB; uploads below threshold are stored with a qualityWarning in the response so they can be filtered at export time, not silently dropped. The same pipeline is exercised in CI on every deploy against a synthetic 38.7 dB fixture (tests/fixtures/audio/test-recording.wav), guaranteeing the analyzer itself stays intact.

Transcription happens in the in-app editor on every speech entry; agreement is currently single-pass. For text-mode NER tasks, multi-annotator agreement is computed automatically (Fleiss’ κ at the character level + pairwise span F1) and surfaced in the admin dashboard, with merged labels written to dataset_entries.

Roadmap. Quality controls planned for future releases (currently not implemented):

SNR analysis for browser-recorded WebM/Opus (needs an ffmpeg decode hop)
Two-pass transcription with independent verification in Praat / ELAN workflows
Forced alignment validation flagging audio-text divergence
Speaker diarization audit for multi-speaker recordings
Voice anonymization (x-vector conversion + pitch shifting)

Anything described in this section as “planned” is on the roadmap, not in production. See docs/audio-collection.md for the source-of-truth on shipped capabilities.

Share:X/Twitter LinkedIn