Research Partnership with EPFL & ETH Zürich

Published: February 10, 20262 min read

Collaboration Framework

The BUDOVA project has established a formal research partnership with École polytechnique fédérale de Lausanne (EPFL) and ETH Zürich, two of Europe’s top-ranked technical universities. This collaboration operates under the broader LINGUA Open Call framework funded by the Microsoft AI for Good Lab, connecting Ukrainian language technology research with cutting-edge European NLP infrastructure.

The partnership is structured around quarterly research sprints, joint supervision of graduate researchers, and shared milestone deliverables. Each institution brings complementary expertise: EPFL’s Idiap Research Institute contributes deep experience in speech processing and speaker verification, while ETH’s Language Technology Group provides state-of-the-art NER and low-resource NLP methods.

A joint steering committee meets monthly to align priorities, review progress, and coordinate publication timelines. All research outputs are committed to open access under the LINGUA program’s open-science mandate.

Technical Contributions

EPFL researchers contribute pre-trained multilingual speech encoder checkpoints that serve as initialization for BUDOVA’s Ukrainian ASR models. These encoders, trained on 50+ languages including several Slavic languages, provide a strong starting point that significantly reduces the amount of Ukrainian-specific data needed to achieve competitive word error rates.

ETH Zürich’s team focuses on cross-lingual transfer learning for named entity recognition. Their approach fine-tunes multilingual transformers (XLM-R, mBERT) on high-resource Slavic NER datasets (Czech, Polish) before adapting to Ukrainian construction text, achieving up to 12% F1 improvement over training from scratch on Ukrainian data alone.

Joint technical contributions include:

Shared annotation guidelines: harmonized NER tag sets compatible with Universal Dependencies
Evaluation benchmarks: standardized test sets for comparing ASR and NER models across LINGUA partner languages
Data augmentation pipelines: synthetic data generation using back-translation and terminology-aware paraphrasing

Shared Resources

A key benefit of the LINGUA consortium is access to shared computational resources. BUDOVA leverages Microsoft Azure AI infrastructure provided through the AI for Good program, including GPU clusters (A100 80GB nodes) for model training and large-scale data processing. These resources are pooled across LINGUA partners, enabling experiments at a scale that would be prohibitive for any single team.

Beyond compute, the partnership facilitates data sharing under strict privacy protocols. Anonymized intermediate representations (embeddings, attention maps) can be exchanged between partners without transferring raw personal data, enabling collaborative model development while maintaining GDPR compliance.

The shared resource portfolio also includes:

Model registry: a private Hugging Face organization where partners share checkpoints and training configs
Annotation platform: a jointly maintained Label Studio instance with custom construction-domain interfaces
Documentation wiki: internal knowledge base for experimental protocols, negative results, and best practices

Share:X/Twitter LinkedIn