Collaboration Framework
The BUDOVA project has established a formal research partnership with École polytechnique fédérale de Lausanne (EPFL) and ETH Zürich, two of Europe’s top-ranked technical universities. This collaboration operates under the broader LINGUA Open Call framework funded by the Microsoft AI for Good Lab, connecting Ukrainian language technology research with cutting-edge European NLP infrastructure.
The partnership is structured around quarterly research sprints, joint supervision of graduate researchers, and shared milestone deliverables. Each institution brings complementary expertise: EPFL’s Idiap Research Institute contributes deep experience in speech processing and speaker verification, while ETH’s Language Technology Group provides state-of-the-art NER and low-resource NLP methods.
A joint steering committee meets monthly to align priorities, review progress, and coordinate publication timelines. All research outputs are committed to open access under the LINGUA program’s open-science mandate.
Technical Contributions
EPFL researchers contribute pre-trained multilingual speech encoder checkpoints that serve as initialization for BUDOVA’s Ukrainian ASR models. These encoders, trained on 50+ languages including several Slavic languages, provide a strong starting point that significantly reduces the amount of Ukrainian-specific data needed to achieve competitive word error rates.
ETH Zürich’s team focuses on cross-lingual transfer learning for named entity recognition. Their approach fine-tunes multilingual transformers (XLM-R, mBERT) on high-resource Slavic NER datasets (Czech, Polish) before adapting to Ukrainian construction text, achieving up to 12% F1 improvement over training from scratch on Ukrainian data alone.
Joint technical contributions include:
- Shared annotation guidelines: harmonized NER tag sets compatible with Universal Dependencies
- Evaluation benchmarks: standardized test sets for comparing ASR and NER models across LINGUA partner languages
- Data augmentation pipelines: synthetic data generation using back-translation and terminology-aware paraphrasing