An Audio-Visual Speech Separation and Personalized Keyphrase Detection in Noisy Environments

Srajan Kumar Tejas Nayak B Rishab R Budale Hithaish

Humans possess an exceptional ability to focus on a specific audio source amidst noise, known as the cocktail party effect. Drawing inspiration from this, our project focuses on creating a system that isolates audio from individual speakers in a multi-speaker envi- ronment using advanced attention models. By leveraging RMS and Pearson correlation coefficients, the system effectively relates variations in audio signals to corresponding lip movement energy. This accurate mapping ensures better synchronization of audio with the respective speaker. Additionally, we are generating personalized captions for each individual speaker, further enhancing the accessibility and clarity of multi-speaker content.

Features

Isolates each speaker's voice, even in overlapping conversations.

Synchronizes separated audio with corresponding lip movements.

Generates individual captions for each speaker in multi-speaker scenarios.

Example 1

Input Video

Output Video 1

Transcription 1

Don't explain it away now. You don't explain it away. You said it was all right to lose on purpose. You said it's all right to lose on purpose. And advertise that to the fence. It's perfectly okay. You said it's okay. Yes. We have nothing else to talk about.

Output Video 2

Transcription 2

he's not on a dandy angel level but he's a popular and you know that the other words he'll understand the enough if i were you fixers fan you said it's all right yeah of course it's called thank you who does that you said it's okay yes yes

Example 2

Input Video

Output Video 1

Transcription 1

When it comes to adulthood, the lexasemia respects their challenges and believes them when they tell you about their experiences. If you're far-hooked trouble telling you how they are feeling or how they feel about a certain topic, do your best not to take a personal interest even if that topic is you. People with a lexasemia can...

Output Video 2

Transcription 2

Greetings. I'd like to spend a little time talking about chapter 10, Autorotic Ephixiation. Autorotic Ephixiation is relatively rare, but it is important for students to understand and especially those that are involved in the medical...

Technologies

speechbrain/sepformer-wsj02mix

dlib

pearson-correlation

euclidean-distance

jupyter-notebook

openai/whisper

savitzky-golay-filter

python