This site uses essential browser storage for authentication and preferences. No tracking cookies are used. Privacy Policy
Semantic space

Corpus embeddings

A 2D projection of BUDOVA sentence embeddings — registers cluster cleanly because the domain vocabulary is so distinctive.

Preview · synthetic layout until v1.0
NormativeEstimatesField speechCorporateAcademic

Current layout is a seeded synthetic scatter with cluster centroids matching observed register separability. Real UMAP projection replaces this after v1.0 when the trained encoder runs over the full corpus.

Collaboration

Join BUDOVA

We are looking for researchers, construction professionals, and language specialists to participate in the project.

Supported by
Microsoft AI for Good Lab