An Audio-Visual Speech Separation and Personalized Keyphrase Detection in Noisy Environments

The An Audio-Visual Speech Separation and Personalized Keyphrase Detection in Noisy Environments is an advanced project inspired by the human brain's ability to focus on a single voice amid overlapping conversations, known as the "cocktail party effect". This system addresses challenges in scenarios like conferences, public events, and crowded spaces where traditional audio processing falls short. By leveraging audio-visual cues, the project combines facial movement detection, such as lip movements, with advanced audio filtration techniques to remove background noise and improve transcription accuracy. The system ensures synchronization and clarity by mapping audio to corresponding visual elements, making it applicable in domains like security, media production, and assistive technologies.
Key Features
Speaker Isolation
Isolates each speaker's voice, even in overlapping conversations.
Audio Sync
Synchronizes separated audio with corresponding lip movements.
Captions
Generates individual captions for each speaker.