
A new AI translation system for headphones clones multiple voices simultaneously
Rhiannon Williams
created: May 9, 2025, 9 a.m. | updated: May 14, 2025, 1:13 p.m.
Spatial Speech Translation consists of two AI models, the first of which divides the space surrounding the person wearing the headphones into small regions and uses a neural network to search for potential speakers and pinpoint their direction.
The second model then translates the speakers’ words from French, German, or Spanish into English text using publicly available data sets.
The same model extracts the unique characteristics and emotional tone of each speaker’s voice, such as the pitch and the amplitude, and applies those properties to the text, essentially creating a “cloned” voice.
This remains a major challenge, because the speed at which an AI system can translate one language into another depends on the languages’ structure.
Reducing the latency could make the translations less accurate, he warns: “The longer you wait [before translating], the more context you have, and the better the translation will be.
1 month, 2 weeks ago: MIT Technology Review