An Audio-Visual Speech Separation and Personalized Keyphrase Detection in Noisy Environments

hero-image

Humans possess an exceptional ability to focus on a specific audio source amidst noise, known as the cocktail party effect. Drawing inspiration from this, our project focuses on creating a system that isolates audio from individual speakers in a multi-speaker envi- ronment using advanced attention models. By leveraging RMS and Pearson correlation coefficients, the system effectively relates variations in audio signals to corresponding lip movement energy. This accurate mapping ensures better synchronization of audio with the respective speaker. Additionally, we are generating personalized captions for each individual speaker, further enhancing the accessibility and clarity of multi-speaker content.

Features

one

Isolates each speaker's voice, even in overlapping conversations.

two

Synchronizes separated audio with corresponding lip movements.

three

Generates individual captions for each speaker in multi-speaker scenarios.

Example 1

Input Video

Output Video 1

Transcription 1

Output Video 2

Transcription 2

Example 2

Input Video

Output Video 1

Transcription 1

Output Video 2

Transcription 2

Technologies

speechbrain/sepformer-wsj02mix
dlib
pearson-correlation
euclidean-distance
jupyter-notebook
openai/whisper
savitzky-golay-filter
python