An Audio-Visual Speech Separation and Personalized Keyphrase Detection in Noisy Environments

Humans possess an exceptional ability to focus on a specific audio source amidst noise, known as the cocktail party effect. Drawing inspiration from this, our project focuses on creating a system that isolates audio from individual speakers in a multi-speaker envi- ronment using advanced attention models. By leveraging RMS and Pearson correlation coefficients, the system effectively relates variations in audio signals to corresponding lip movement energy. This accurate mapping ensures better synchronization of audio with the respective speaker. Additionally, we are generating personalized captions for each individual speaker, further enhancing the accessibility and clarity of multi-speaker content.
Features
Isolates each speaker's voice, even in overlapping conversations.
Synchronizes separated audio with corresponding lip movements.
Generates individual captions for each speaker in multi-speaker scenarios.