All Projects → mayurnewase → looking-to-listen-at-cocktail-party

mayurnewase / looking-to-listen-at-cocktail-party

Licence: other
Looking to listen at cocktail party

Programming Languages

Jupyter Notebook
11667 projects

Projects that are alternatives of or similar to looking-to-listen-at-cocktail-party

video-audio-tools
To process/edit video and audio with Python+FFmpeg. [简单实用] 基于Python+FFmpeg的视频和音频的处理/剪辑。
Stars: ✭ 164 (+396.97%)
Mutual labels:  video-processing, audio-processing
Auto Editor
Auto-Editor: Effort free video editing!
Stars: ✭ 382 (+1057.58%)
Mutual labels:  video-processing, audio-processing
DuME
A fast, versatile, easy-to-use and cross-platform Media Encoder based on FFmpeg
Stars: ✭ 66 (+100%)
Mutual labels:  video-processing, audio-processing
eloquent-ffmpeg
High-level API for FFmpeg's Command Line Tools
Stars: ✭ 71 (+115.15%)
Mutual labels:  video-processing, audio-processing
Avdemo
Demo projects for iOS Audio & Video development.
Stars: ✭ 136 (+312.12%)
Mutual labels:  video-processing, audio-processing
lecture-demos
Demonstrations for the interactive exploration of selected core concepts of audio, image and video processing as well as related topics
Stars: ✭ 12 (-63.64%)
Mutual labels:  video-processing, audio-processing
Vectorhub
Vector Hub - Library for easy discovery, and consumption of State-of-the-art models to turn data into vectors. (text2vec, image2vec, video2vec, graph2vec, bert, inception, etc)
Stars: ✭ 317 (+860.61%)
Mutual labels:  video-processing, audio-processing
ion-avp
Audio/Video Processing Service
Stars: ✭ 55 (+66.67%)
Mutual labels:  video-processing, audio-processing
Video2description
Video to Text: Generates description in natural language for given video (Video Captioning)
Stars: ✭ 107 (+224.24%)
Mutual labels:  video-processing, audio-processing
Arcan
Arcan - [Display Server, Multimedia Framework, Game Engine] -> "Desktop Engine"
Stars: ✭ 885 (+2581.82%)
Mutual labels:  video-processing, audio-processing
Mlt
MLT Multimedia Framework
Stars: ✭ 836 (+2433.33%)
Mutual labels:  video-processing, audio-processing
Unsilence
Console Interface and Library to remove silent parts of a media file 🔈
Stars: ✭ 197 (+496.97%)
Mutual labels:  video-processing, audio-processing
Mediapipe
Cross-platform, customizable ML solutions for live and streaming media.
Stars: ✭ 15,338 (+46378.79%)
Mutual labels:  video-processing, audio-processing
ffcvt
ffmpeg convert wrapper tool
Stars: ✭ 32 (-3.03%)
Mutual labels:  video-processing, audio-processing
laav
Asynchronous Audio / Video Library for H264 / MJPEG / OPUS / AAC / MP2 encoding, transcoding, recording and streaming from live sources
Stars: ✭ 50 (+51.52%)
Mutual labels:  video-processing
RTspice
A real-time netlist based audio circuit plugin
Stars: ✭ 51 (+54.55%)
Mutual labels:  audio-processing
dspjargon
All the jargon you need to understand the world of Digital Signal Processing.
Stars: ✭ 37 (+12.12%)
Mutual labels:  audio-processing
RS-MET
Codebase for RS-MET products (Robin Schmidt's Music Engineering Tools)
Stars: ✭ 32 (-3.03%)
Mutual labels:  audio-processing
tsunami
A simple but powerful audio editor
Stars: ✭ 41 (+24.24%)
Mutual labels:  audio-processing
pyanime4k
An easy way to use anime4k in python
Stars: ✭ 80 (+142.42%)
Mutual labels:  video-processing

alt title

This is Keras+Tensorflow implementation of paper "Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation" by Ephrat et el. from Google Research. The project also uses ideas from the paper "Seeing Through Noise:Visually Driven Speaker Seperation and Enhancement"

Compatibility

The code is tested using Tensorflow 1.13.1 under Ubuntu 18.00 with python 3.6.

News

Date Update
26-06-2019 Readymade datasets removed from kaggle server for storage issue,please make your own from script.
08-06-2019 Notebook added for full pipeline with pretrained model.
25-05-2019 Datasets added for mixed user videos.
23-04-2019 Added automated scripts for creating database structure.

External Dependencies

This repo uses code from facenet and face_recognition for tracking and extracting features from faces in videos.

Usage

Database structure

Given a way to store audio and video datasets efficiently without much duplication.

|--speaker_background_spectrograms/
|  |--per speaker part 1/
|  |    |--speaker_clean.pkl
|  |    |--speaker_chatter_i.pkl
|  |--per speaker part 2/
|  |--  |--speaker_clean.pkl
|       |--speaker_chatter_i.pkl
|--two_speakers_mix_spectrograms/
|	 |--per speaker/
|	 |	|--clean.pkl
|	 |	|--mix_with_other_i.pkl
|--speaker_video_spectrograms
|	 |--per_speaker part 1/
|	 |	|--clean.pkl
|	 |--per_speaker part 2/
|	 |	|--clean.pkl
|--chatter audios/
|  |--part1/
|  |--part2/
|  |--part3/
|--clean audios/
|	 |--videos/
|	 |--frames/
|	 |--pretrained_model/
|	 |  |--facenet_model.h5

Getting started

1.Install all dependencies

pip install -r requirements.txt

2.Run prepare_directory script

./data/prepare_directory.sh

3.download avspeech train and test csv files and put in data/

4.Run background chatter files downloader and slicer to download and slice chatter files.This will download chatter files with tag "/m/07rkbfh" from Audioset

python data/chatter_download.py
python data/chatter_slicer.py

5.Start Downloading data for avspeech dataset and process with your choice with arguments.

python data/data_data_download.py --from_id=0 --to_id=1000 --type_of_dataset=audio_dataset

Arguments available

from_id -> start downloading youtube clips download from train.csv from this id

to_id -> start downloading youtube clips download from train.csv to this id

type_of_dataset -> type of dataset to prepare.
  audio_dataset -> create audio spectrogram mixed with background chatter
  audio_video_dataset -> create audio spectrogram and video embeddings and spectrograms of speaker mixed other speakers audio.
  
low_memory -> clear unnecessary stuff

chatter_part -> user different slots of chatter files to be mixed with clean speakers audio

sample_rate,duration,fps,mono,window,stride,fft_length,amp_norm,chatter_norm -> arguments for STFT and audio processing

face_extraction_model -> select which model to use for facial embedding extraction
  hog -> faster on cpu but less accurate
  cnn -> slower on cpu,faster on nvidia gpu,more accurate

Datasets

  1. Video mixed dataset is availble on my kaggle page in 10 parts.(created using default parameters above)
Go to my kaggle profile(https://www.kaggle.com/mayurnewase)
Click on datasets
Sort by new
Datasets are named by mix_speakers_ultimate_*
Total 10 parts are available.

To do

Check here

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].