Google Unveils Audio-Visual AI Model Capable of Isolating One Voice from Many
Google’s New Audio-Visual AI Speech Separation Model Launched

When you find yourself in a complex acoustic environment, with numerous noise sources and speakers, it’s easy for you to focus on a certain individual you’re talking to, while mentally subduing all the other voices out there. This ability which is known as the cocktail party effect comes naturally to humans. Until now, it remained a challenge for machines to separate a specific audio signal from a complex mixture of signals even after intense research on the field for more than half a century. However, Google has a surprising straightforward solution.

Recently, researchers at the company came up with an artificial intelligence (AI) system or deep learning audio-visual model that can isolate an individual’s voice from a mix of sounds, including other voices and background noise. A newspaper mentioned that the team of researchers at Google were able to create videos on a computer where the voice of a certain individual could be focused on and enhanced while other sounds are toned down.

The method which works on ordinary videos with a single audio track, allows a person watching a video to select the face of the individual in the video who they wish to hear or use an algorithm to select that individual based on the context.

The new method by Google makes use of both audio and video signals to separate the speech. The key to the technique is using visual cues such as mouth movements. It identifies and matches the movement of a person’s mouth with the sounds created to understand which part of the audio signal corresponds to that specific person. “A unique aspect of our technique is in combining both the auditory and visual signals of an input video to separate the speech,” the researchers said.

While developing the speech-separation model, the researchers at Google trained the technique on thousands of videos of high-quality lectures and talks from YouTube, extracting video clips with clean speech and a single speaker. For this, the group gained 2,000 hours of video clips, all with a single speaker on camera without any background noise. With the use of this data, the researchers then created mixtures of face videos and the corresponding speech from different video sources, along with background noises.

According to the team, this ability can find application in a whole range of areas including speech enhancement and recognition in videos, video conferencing, improved hearing aids, and other situations with multiple people speaking. Google is currently looking into where the method can be used in its products.

Besides audio-visual speech separation, AI has numerous other applications in various fields such as gaming, natural language processing, expert systems, vision systems, handwriting recognition, intelligent robots, and more. Initially conceptualized around the 1950s, the technology has been playing a critical role in various industries for decades. It is a new weapon that tech giants use to compete against one other. AI is widely used by companies such as Facebook and Google for providing solutions in areas such as search suggestions, identifying the fastest route to drive, and recognizing someone in a photo.