wq2012 / Awesome Diarization
Licence: apache-2.0
A curated list of awesome Speaker Diarization papers, libraries, datasets, and other resources.
Stars: ✭ 673
Projects that are alternatives of or similar to Awesome Diarization
Speechbrain.github.io
The SpeechBrain project aims to build a novel speech toolkit fully based on PyTorch. With SpeechBrain users can easily create speech processing systems, ranging from speech recognition (both HMM/DNN and end-to-end), speaker recognition, speech enhancement, speech separation, multi-microphone speech processing, and many others.
Stars: ✭ 242 (-64.04%)
Mutual labels: speech-recognition, speech-processing
awesome-keyword-spotting
This repository is a curated list of awesome Speech Keyword Spotting (Wake-Up Word Detection).
Stars: ✭ 150 (-77.71%)
Mutual labels: speech-recognition, speech-processing
UHV-OTS-Speech
A data annotation pipeline to generate high-quality, large-scale speech datasets with machine pre-labeling and fully manual auditing.
Stars: ✭ 94 (-86.03%)
Mutual labels: speech-recognition, speech-processing
Uspeech
Speech recognition toolkit for the arduino
Stars: ✭ 448 (-33.43%)
Mutual labels: speech-recognition, speech-processing
Speech-Backbones
This is the main repository of open-sourced speech technology by Huawei Noah's Ark Lab.
Stars: ✭ 205 (-69.54%)
Mutual labels: speech-recognition, speech-processing
Nonautoreggenprogress
Tracking the progress in non-autoregressive generation (translation, transcription, etc.)
Stars: ✭ 118 (-82.47%)
Mutual labels: speech-recognition, speech-processing
react-native-spokestack
Spokestack: give your React Native app a voice interface!
Stars: ✭ 53 (-92.12%)
Mutual labels: speech-recognition, speech-processing
Sincnet
SincNet is a neural architecture for efficiently processing raw audio samples.
Stars: ✭ 764 (+13.52%)
Mutual labels: speech-recognition, speech-processing
open-speech-corpora
💎 A list of accessible speech corpora for ASR, TTS, and other Speech Technologies
Stars: ✭ 841 (+24.96%)
Mutual labels: speech-recognition, speech-processing
QuantumSpeech-QCNN
IEEE ICASSP 21 - Quantum Convolution Neural Networks for Speech Processing and Automatic Speech Recognition
Stars: ✭ 71 (-89.45%)
Mutual labels: speech-recognition, speech-processing
Keras Sincnet
Keras (tensorflow) implementation of SincNet (Mirco Ravanelli, Yoshua Bengio - https://github.com/mravanelli/SincNet)
Stars: ✭ 47 (-93.02%)
Mutual labels: speech-recognition, speech-processing
UniSpeech
UniSpeech - Large Scale Self-Supervised Learning for Speech
Stars: ✭ 224 (-66.72%)
Mutual labels: speech-recognition, speech-processing
Formant Analyzer
iOS application for finding formants in spoken sounds
Stars: ✭ 43 (-93.61%)
Mutual labels: speech-recognition, speech-processing
Zzz Retired openstt
RETIRED - OpenSTT is now retired. If you would like more information on Mycroft AI's open source STT projects, please visit:
Stars: ✭ 146 (-78.31%)
Mutual labels: speech-recognition, speech-processing
Pncc
A implementation of Power Normalized Cepstral Coefficients: PNCC
Stars: ✭ 40 (-94.06%)
Mutual labels: speech-recognition, speech-processing
torchsubband
Pytorch implementation of subband decomposition
Stars: ✭ 63 (-90.64%)
Mutual labels: speech-recognition, speech-processing
speechrec
a simple speech recognition app using the Web Speech API Interfaces
Stars: ✭ 18 (-97.33%)
Mutual labels: speech-recognition, speech-processing
spokestack-ios
Spokestack: give your iOS app a voice interface!
Stars: ✭ 27 (-95.99%)
Mutual labels: speech-recognition, speech-processing
scim
[wip]Speech recognition tool-box written by Nim. Based on Arraymancer.
Stars: ✭ 17 (-97.47%)
Mutual labels: speech-recognition, speech-processing
Sonus
💬 /so.nus/ STT (speech to text) for Node with offline hotword detection
Stars: ✭ 532 (-20.95%)
Mutual labels: speech-recognition
Awesome Speaker Diarization
Table of contents
Overview
This is a curated list of awesome Speaker Diarization papers, libraries, datasets, and other resources.
The purpose of this repo is to organize the world’s resources for speaker diarization, and make them universally accessible and useful.
To add items to this page, simply send a pull request. (contributing guide)
Publications
Special topics
Review & survey papers
- A Review of Speaker Diarization: Recent Advances with Deep Learning, 2021
- A review on speaker diarization systems and approaches, 2012
- Speaker diarization: A review of recent research, 2010
Supervisied diarization
- Supervised online diarization with sample mean loss for multi-domain data, 2019
- Discriminative Neural Clustering for Speaker Diarisation, 2019
- End-to-End Neural Speaker Diarization with Permutation-Free Objectives, 2019
- End-to-End Neural Speaker Diarization with Self-attention, 2019
- Fully Supervised Speaker Diarization, 2018
Joint diarization and ASR
- Joint Speech Recognition and Speaker Diarization via Sequence Transduction, 2019
- Says who? Deep learning models for joint speech recognition, segmentation and diarization, 2018
Challenges
- Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge, 2018
- ODESSA at Albayzin Speaker Diarization Challenge 2018, 2018
- Joint Discriminative Embedding Learning, Speech Activity and Overlap Detection for the DIHARD Challenge, 2018
Other
2020
- Online Speaker Diarization with Relation Network
- An End-to-End Speaker Diarization Service for improving Multimedia Content Access
- Spot the conversation: speaker diarisation in the wild
- Speaker Diarization with Region Proposal Network
- Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario
2019
- Overlap-aware diarization: resegmentation using neural end-to-end overlapped speech detection
- Speaker diarization using latent space clustering in generative adversarial network
- A study of semi-supervised speaker diarization system using gan mixture model
- Learning deep representations by multilayer bootstrap networks for speaker diarization
- Enhancements for Audio-only Diarization Systems
- LSTM based Similarity Measurement with Spectral Clustering for Speaker Diarization
- Meeting Transcription Using Virtual Microphone Arrays
- Speaker diarisation using 2D self-attentive combination of embeddings
- Speaker Diarization with Lexical Information
2018
- Neural speech turn segmentation and affinity propagation for speaker diarization
- Multimodal Speaker Segmentation and Diarization using Lexical and Acoustic Cues via Sequence to Sequence Neural Networks
- Joint Speaker Diarization and Recognition Using Convolutional and Recurrent Neural Networks
2017
- Speaker Diarization with LSTM
- Speaker diarization using deep neural network embeddings
- Speaker diarization using convolutional neural network for statistics accumulation refinement
- pyannote. metrics: a toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems
- Speaker Change Detection in Broadcast TV using Bidirectional Long Short-Term Memory Networks
- Speaker Diarization using Deep Recurrent Convolutional Neural Networks for Speaker Embeddings
2016
2015
2014
- A study of the cosine distance-based mean shift for telephone speech diarization
- Speaker diarization with PLDA i-vector scoring and unsupervised calibration
- Artificial neural network features for speaker diarization
2013
2011
- PLDA-based Clustering for Speaker Diarization of Broadcast Streams
- Speaker diarization of meetings based on speaker role n-gram models
2009
2008
2006
- An overview of automatic speaker diarization systems
- A spectral clustering approach to speaker diarization
Software
Framework
Link | Language | Description |
---|---|---|
SpeechBrain | Python & PyTorch | SpeechBrain is an open-source and all-in-one speech toolkit based on PyTorch. |
SIDEKIT for diarization (s4d) | Python | An open source package extension of SIDEKIT for Speaker diarization. |
pyAudioAnalysis | Python | Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications. |
AaltoASR | Python & Perl | Speaker diarization scripts, based on AaltoASR. |
LIUM SpkDiarization | Java | LIUM_SpkDiarization is a software dedicated to speaker diarization (i.e. speaker segmentation and clustering). It is written in Java, and includes the most recent developments in the domain (as of 2013). |
kaldi-asr | Bash | Example scripts for speaker diarization on a portion of CALLHOME used in the 2000 NIST speaker recognition evaluation. |
Alize LIA_SpkSeg | C++ | ALIZÉ is an opensource platform for speaker recognition. LIA_SpkSeg is the tools for speaker diarization. |
pyannote-audio | Python | Neural building blocks for speaker diarization: speech activity detection, speaker change detection, speaker embedding. |
pyBK | Python | Speaker diarization using binary key speaker modelling. Computationally light solution that does not require external training data. |
Speaker-Diarization | Python | Speaker diarization using uis-rnn and GhostVLAD. An easier way to support openset speakers. |
EEND | Python & Bash & Perl | End-to-End Neural Diarization. |
VBDiarization | Python | Speaker diarization based on Kaldi x-vectors using pretrained model trained in Kaldi (kaldi-asr/kaldi) and converted to ONNX format (onnx/onnx) running in ONNXRuntime (Microsoft/onnxruntime). |
RE-VERB | Python & JavaScript | RE: VERB is speaker diarization system, it allows the user to send/record audio of a conversation and receive timestamps of who spoke when. |
Evaluation
Link | Language | Description |
---|---|---|
pyannote-metrics | Python | A toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems. |
SimpleDER | Python | A lightweight library to compute Diarization Error Rate (DER). |
NIST md-eval | Perl | (1) modified md-eval.pl from Mary Tai Knox; (2) md-eval-v21.pl from jitendra; (3) md-eval-22.pl from nryant |
dscore | Python & Perl | Diarization scoring tools. |
Sequence Match Accuracy | Python | Match the accuracy of two sequences with Hungarian algorithm. |
Clustering
Link | Language | Description |
---|---|---|
uis-rnn | Python & PyTorch | Google's Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, for Fully Supervised Speaker Diarization. This clustering algorithm is supervised. |
uis-rnn-sml | Python & PyTorch | A variant of UIS-RNN, for the paper Supervised Online Diarization with Sample Mean Loss for Multi-Domain Data. |
DNC | Python & ESPnet | Transformer-based Discriminative Neural Clustering (DNC) for Speaker Diarisation. Like UIS-RNN, it is supervised. |
SpectralCluster | Python | Spectral clustering with affinity matrix refinement operations. |
sklearn.cluster | Python | scikit-learn clustering algorithms. |
PLDA | Python | Probabilistic Linear Discriminant Analysis & classification, written in Python. |
PLDA | C++ | Open-source implementation of simplified PLDA (Probabilistic Linear Discriminant Analysis). |
Auto-Tuning Spectral Clustering | Python | Auto-tuning Spectral Clustering method that does not need development set or supervised tuning. |
Speaker embedding
Link | Method | Language | Description |
---|---|---|---|
resemble-ai/Resemblyzer | d-vector | Python & PyTorch | PyTorch implementation of generalized end-to-end loss for speaker verification, which can be used for voice cloning and diarization. |
Speaker_Verification | d-vector | Python & TensorFlow | Tensorflow implementation of generalized end-to-end loss for speaker verification. |
PyTorch_Speaker_Verification | d-vector | Python & PyTorch | PyTorch implementation of "Generalized End-to-End Loss for Speaker Verification" by Wan, Li et al. With UIS-RNN integration. |
Real-Time Voice Cloning | d-vector | Python & PyTorch | Implementation of "Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis" (SV2TTS) with a vocoder that works in real-time. |
deep-speaker | d-vector | Python & Keras | Third party implementation of the Baidu paper Deep Speaker: an End-to-End Neural Speaker Embedding System. |
x-vector-kaldi-tf | x-vector | Python & TensorFlow & Perl | Tensorflow implementation of x-vector topology on top of Kaldi recipe. |
kaldi-ivector | i-vector | C++ & Perl | Extension to Kaldi implementing the standard i-vector hyperparameter estimation and i-vector extraction procedure. |
voxceleb-ivector | i-vector | Perl | Voxceleb1 i-vector based speaker recognition system. |
pytorch_xvectors | x-vector | Python & PyTorch | PyTorch implementation of Voxceleb x-vectors. Additionaly, includes meta-learning architectures for embedding training. Evaluated with speaker diarization and speaker verification. |
Speaker change detection
Link | Language | Description |
---|---|---|
change_detection | Python & Keras | Code for Speaker Change Detection in Broadcast TV using Bidirectional Long Short-Term Memory Networks. |
Audio feature extraction
Link | Language | Description |
---|---|---|
LibROSA | Python | Python library for audio and music analysis. https://librosa.github.io/ |
python_speech_features | Python | This library provides common speech features for ASR including MFCCs and filterbank energies. https://python-speech-features.readthedocs.io/en/latest/ |
pyAudioAnalysis | Python | Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications. |
Audio data augmentation
Link | Language | Description |
---|---|---|
pyroomacoustics | Python | Pyroomacoustics is a package for audio signal processing for indoor applications. It was developed as a fast prototyping platform for beamforming algorithms in indoor scenarios. https://pyroomacoustics.readthedocs.io |
gpuRIR | Python | Python library for Room Impulse Response (RIR) simulation with GPU acceleration |
rir_simulator_python | Python | Room impulse response simulator using python |
Other software
Link | Language | Description |
---|---|---|
VB Diarization | Python | VB Diarization with Eigenvoice and HMM Priors. |
Datasets
Diarization datasets
Audio | Diarization ground truth | Language | Pricing | Additional information |
---|---|---|---|---|
2000 NIST Speaker Recognition Evaluation | Disk-6 (Switchboard), Disk-8 (CALLHOME) | Multiple | $2400.00 | Evaluation Plan |
2003 NIST Rich Transcription Evaluation Data | Together with audios | en, ar, zh | $2000.00 | telephone speech, broadcast news |
CALLHOME American English Speech | CALLHOME American English Transcripts | en | $1500.00 + $1000.00 | CH109 whitelist |
The ICSI Meeting Corpus | Together with audios | en | Free | License |
The AMI Meeting Corpus | Together with audios (need to be processed) | Multiple | Free | License |
Fisher English Training Speech Part 1 Speech | Fisher English Training Speech Part 1 Transcripts | en | $7000.00 + $1000.00 | |
Fisher English Training Part 2, Speech | Fisher English Training Part 2, Transcripts | en | $7000.00 + $1000.00 | |
VoxConverse | TBD | TBD | Free | VoxConverse is an audio-visual diarisation dataset consisting of over 50 hours of multispeaker clips of human speech, extracted from YouTube videos |
Speaker embedding training sets
Name | Utterances | Speakers | Language | Pricing | Additional information |
---|---|---|---|---|---|
TIMIT | 6K+ | 630 | en | $250.00 | Published in 1993, the TIMIT corpus of read speech is one of the earliest speaker recognition datasets. |
VCTK | 43K+ | 109 | en | Free | Most were selected from a newspaper plus the Rainbow Passage and an elicitation paragraph intended to identify the speaker's accent. |
LibriSpeech | 292K | 2K+ | en | Free | Large-scale (1000 hours) corpus of read English speech. |
Multilingual LibriSpeech (MLS) | ? | ? | en, de, nl, es, fr, it, pt, po | Free | Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish. |
LibriVox | 180K | 9K+ | Multiple | Free | Free public domain audiobooks. LibriSpeech is a processed subset of LibriVox. Each original unsegmented utterance could be very long. |
VoxCeleb 1&2 | 1M+ | 7K | Multiple | Free | VoxCeleb is an audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube. |
The Spoken Wikipedia Corpora | 5K | 879 | en, de, nl | Free | Volunteer readers reading Wikipedia articles. |
CN-Celeb | 130K+ | 1K | zh | Free | A Free Chinese Speaker Recognition Corpus Released by [email protected] University. |
BookTubeSpeech | 8K | 8K | en | Free | Audio samples extracted from BookTube videos - videos where people share their opinions on books - from YouTube. The dataset can be downloaded using BookTubeSpeech-download. |
DeepMine | 540K | 1850 | fa, en | Unknown | A speech database in Persian and English designed to build and evaluate speaker verification, as well as Persian ASR systems. |
NISP-Dataset | ? | 345 | hi, kn, ml, ta, te (all Indian languages) | Free | This dataset contains speech recordings along with speaker physical parameters (height, weight, ... ) as well as regional information and linguistic information. |
Augmentation noise sources
Name | Utterances | Pricing | Additional information |
---|---|---|---|
AudioSet | 2M | Free | A large-scale dataset of manually annotated audio events. |
MUSAN | N/A | Free | MUSAN is a corpus of music, speech, and noise recordings. |
Conferences
Conference/Workshop | Frequency | Page Limit | Organization | Blind Review |
---|---|---|---|---|
ICASSP | Annual | 4 + 1 (ref) | IEEE | No |
InterSpeech | Annual | 4 + 1 (ref) | ISCA | No |
Speaker Odyssey | Biennial | 8 + 2 (ref) | ISCA | No |
SLT | Biennial | 6 + 2 (ref) | IEEE | Yes |
ASRU | Biennial | 6 + 2 (ref) | IEEE | Yes |
WASPAA | Biennial | 4 + 1 (ref) | IEEE | No |
Other learning materials
Books
- Voice Identity Techniques: From core algorithms to engineering practice (Chinese) by Quan Wang, 2020
Tech blogs
- Literature Review For Speaker Change Detection by Halil Erdoğan
- Speaker Diarization: Separation of Multiple Speakers in an Audio File by Jaspreet Singh
- Speaker Diarization with Kaldi by Yoav Ramon
- Who spoke when! How to Build your own Speaker Diarization Module by Rahul Saxena
Video tutorials
- pyannote audio: neural building blocks for speaker diarization by Hervé Bredin
- Google's Diarization System: Speaker Diarization with LSTM by Google
- Fully Supervised Speaker Diarization: Say Goodbye to clustering by Google
- Speaker Diarization: Optimal Clustering and Learning Speaker Embeddings by Microsoft Research
- Robust Speaker Diarization for Meetings: the ICSI system by Microsoft Research
- 【机器之心&博文视点】入门声纹技术|第二讲:声纹分割聚类与其他应用 by Quan Wang
Products
Company | Product |
---|---|
Google Cloud Speech-to-Text API | |
Amazon | Amazon Transcribe |
IBM | Watson Speech To Text API |
DeepAffects | Speaker Diarization API |
Note that the project description data, including the texts, logos, images, and/or trademarks,
for each open source project belongs to its rightful owner.
If you wish to add or remove any projects, please contact us at [email protected].