Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → philipperemy → Deep Speaker

philipperemy / Deep Speaker

Licence: mit

Deep Speaker: an End-to-End Neural Speaker Embedding System.

Programming Languages

python

139335 projects - #7 most used programming language

Labels

deep-learning tensorflow keras

Deep Speaker: An End-to-End Neural Speaker Embedding System.

Unofficial Keras implementation of Deep Speaker | Paper | Pretrained Models.

Sample Results

Models were trained on clean speech data. Keep in mind that the performance will be lower on noisy data. It is advised to remove silence and background noise before computing the embeddings (by using Sox for example).

Model name	Testing dataset	Num speakers	F	TPR	ACC	EER	Training Logs	Download model
ResCNN Softmax trained	LibriSpeech all(*)	2484	0.789	0.733	0.996	0.043	Click	Click
ResCNN Softmax+Triplet trained	LibriSpeech all(*)	2484	0.843	0.825	0.997	0.025	Click	Click

(*) all includes: dev-clean, dev-other, test-clean, test-other, train-clean-100, train-clean-360, train-other-500.

Overview

Deep Speaker is a neural speaker embedding system that maps utterances to a hypersphere where speaker similarity is measured by cosine similarity. The embeddings generated by Deep Speaker can be used for many tasks, including speaker identification, verification, and clustering.

Getting started

Install dependencies

Requirements

tensorflow>=2.0
keras>=2.3.1
python>=3.6

pip install -r requirements.txt

If you see this error: libsndfile not found, run this: sudo apt-get install libsndfile-dev.

Training

The code for training is available in this repository. It takes a bit less than a week with a GTX1070 to train the models.

System requirements for a complete training are:

At least 300GB of free disk space on a fast SSD (250GB just for all the uncompressed + processed data)
32GB of memory and at least 32GB of swap (can create swap with SSD space).
A NVIDIA GPU such as the 1080Ti.

pip uninstall -y tensorflow && pip install tensorflow-gpu
./deep-speaker download_librispeech    # if the download is too slow, consider replacing [wget] by [axel -n 10 -a] in download_librispeech.sh.
./deep-speaker build_mfcc              # will build MFCC for softmax pre-training and triplet training.
./deep-speaker build_model_inputs      # will build inputs for softmax pre-training.
./deep-speaker train_softmax           # takes ~3 days.
./deep-speaker train_triplet           # takes ~3 days.

NOTE: If you want to use your own dataset, make sure you follow the directory structure of librispeech. Audio files have to be in .flac. format. If you have .wav, you can use ffmpeg to make the conversion. Both formats are flawless (FLAC is compressed WAV).

Test instruction using pretrained model

Download the trained models

Model name	Used datasets for training	Num speakers	Model Link
ResCNN Softmax trained	LibriSpeech train-clean-360	921	Click
ResCNN Softmax+Triplet trained	LibriSpeech all	2484	Click

Run with pretrained model

import random

import numpy as np

from audio import read_mfcc
from batcher import sample_from_mfcc
from constants import SAMPLE_RATE, NUM_FRAMES
from conv_models import DeepSpeakerModel
from test import batch_cosine_similarity

# Reproducible results.
np.random.seed(123)
random.seed(123)

# Define the model here.
model = DeepSpeakerModel()

# Load the checkpoint. https://drive.google.com/file/d/1F9NvdrarWZNktdX9KlRYWWHDwRkip_aP.
model.m.load_weights('ResCNN_triplet_training_checkpoint_265.h5', by_name=True)

# Sample some inputs for WAV/FLAC files for the same speaker.
# To have reproducible results every time you call this function, set the seed every time before calling it.
# np.random.seed(123)
# random.seed(123)
mfcc_001 = sample_from_mfcc(read_mfcc('samples/PhilippeRemy/PhilippeRemy_001.wav', SAMPLE_RATE), NUM_FRAMES)
mfcc_002 = sample_from_mfcc(read_mfcc('samples/PhilippeRemy/PhilippeRemy_002.wav', SAMPLE_RATE), NUM_FRAMES)

# Call the model to get the embeddings of shape (1, 512) for each file.
predict_001 = model.m.predict(np.expand_dims(mfcc_001, axis=0))
predict_002 = model.m.predict(np.expand_dims(mfcc_002, axis=0))

# Do it again with a different speaker.
mfcc_003 = sample_from_mfcc(read_mfcc('samples/1255-90413-0001.flac', SAMPLE_RATE), NUM_FRAMES)
predict_003 = model.m.predict(np.expand_dims(mfcc_003, axis=0))

# Compute the cosine similarity and check that it is higher for the same speaker.
print('SAME SPEAKER', batch_cosine_similarity(predict_001, predict_002)) # SAME SPEAKER [0.81564593]
print('DIFF SPEAKER', batch_cosine_similarity(predict_001, predict_003)) # DIFF SPEAKER [0.1419204]

Commands to reproduce the test results after the training

$ export CUDA_VISIBLE_DEVICES=0; python cli.py test-model --working_dir ~/.deep-speaker-wd/triplet-training/ --
checkpoint_file checkpoints-softmax/ResCNN_checkpoint_102.h5
f-measure = 0.789, true positive rate = 0.733, accuracy = 0.996, equal error rate = 0.043

$ export CUDA_VISIBLE_DEVICES=0; python cli.py test-model --working_dir ~/.deep-speaker-wd/triplet-training/ --checkpoint_file checkpoints-triplets/ResCNN_checkpoint_265.h5
f-measure = 0.849, true positive rate = 0.798, accuracy = 0.997, equal error rate = 0.025

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 563

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (3) 🔗