An Audio-Visual Speech Separation and Personalized Keyphrase Detection in Noisy Environments

hero-image

The An Audio-Visual Speech Separation and Personalized Keyphrase Detection in Noisy Environments is an advanced project inspired by the human brain's ability to focus on a single voice amid overlapping conversations, known as the "cocktail party effect". This system addresses challenges in scenarios like conferences, public events, and crowded spaces where traditional audio processing falls short. By leveraging audio-visual cues, the project combines facial movement detection, such as lip movements, with advanced audio filtration techniques to remove background noise and improve transcription accuracy. The system ensures synchronization and clarity by mapping audio to corresponding visual elements, making it applicable in domains like security, media production, and assistive technologies.

Key Features

Speaker Isolation

Isolates each speaker's voice, even in overlapping conversations.

Audio Sync

Synchronizes separated audio with corresponding lip movements.

Captions

Generates individual captions for each speaker.

Example 1

Input Video

Output Video 1

Transcription 1

Output Video 2

Transcription 2

Example 2

Input Video

Output Video 1

Transcription 1

Output Video 2

Transcription 2

Technologies Used

speechbrain/sepformer-wsj02mix
dlib
pearson-correlation
euclidean-distance
jupyter-notebook
openai/whisper
savitzky-golay-filter
python