All Projects → syhw → Wer_are_we

syhw / Wer_are_we

Attempt at tracking states of the arts and recent results (bibliography) on speech recognition.

Projects that are alternatives of or similar to Wer are we

speech-recognition-evaluation
Evaluate results from ASR/Speech-to-Text quickly
Stars: ✭ 25 (-98.52%)
Mutual labels:  speech-recognition, wer
Rnn Transducer
MXNet implementation of RNN Transducer (Graves 2012): Sequence Transduction with Recurrent Neural Networks
Stars: ✭ 114 (-93.23%)
Mutual labels:  speech-recognition
Wav2letter.pytorch
A fully convolution-network for speech-to-text, built on pytorch.
Stars: ✭ 104 (-93.82%)
Mutual labels:  speech-recognition
Python Speech recognition
A simple example for use speech recognition baidu api with python.
Stars: ✭ 106 (-93.71%)
Mutual labels:  speech-recognition
Delta
DELTA is a deep learning based natural language and speech processing platform.
Stars: ✭ 1,479 (-12.17%)
Mutual labels:  speech-recognition
Transformers
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Stars: ✭ 55,742 (+3210.1%)
Mutual labels:  speech-recognition
Speech And Text
Speech to text (PocketSphinx, Iflytex API, Baidu API) and text to speech (pyttsx3) | 语音转文字(PocketSphinx、百度 API、科大讯飞 API)和文字转语音(pyttsx3)
Stars: ✭ 102 (-93.94%)
Mutual labels:  speech-recognition
Sounder
An intent recognizing algorithm to predict the intent of a given text.
Stars: ✭ 118 (-92.99%)
Mutual labels:  speech-recognition
Kontinuousspeechrecognizer
A Kotlin Speech Recognizer that runs continuously and is triggered with an activation keyword
Stars: ✭ 113 (-93.29%)
Mutual labels:  speech-recognition
E2e Asr
PyTorch Implementations for End-to-End Automatic Speech Recognition
Stars: ✭ 106 (-93.71%)
Mutual labels:  speech-recognition
Pansori
Tools for ASR Corpus Generation from Online Video
Stars: ✭ 106 (-93.71%)
Mutual labels:  speech-recognition
Ios ml
List of Machine Learning, AI, NLP solutions for iOS. The most recent version of this article can be found on my blog.
Stars: ✭ 1,409 (-16.33%)
Mutual labels:  speech-recognition
Kalliope
Kalliope is a framework that will help you to create your own personal assistant.
Stars: ✭ 1,509 (-10.39%)
Mutual labels:  speech-recognition
Kaldi Gop
Computes the GMM-based Goodness of Pronunciation (GOP). Bases on Kaldi.
Stars: ✭ 104 (-93.82%)
Mutual labels:  speech-recognition
Holobot
HoloBot is a reusable 3D interface that allows HoloLens & VR users to interact with any bot using Mixed Reality & Speech.
Stars: ✭ 114 (-93.23%)
Mutual labels:  speech-recognition
Spokestack Python
Spokestack is a library that allows a user to easily incorporate a voice interface into any Python application.
Stars: ✭ 103 (-93.88%)
Mutual labels:  speech-recognition
Bigcidian
Pronunciation lexicon covering both English and Chinese languages for Automatic Speech Recognition.
Stars: ✭ 99 (-94.12%)
Mutual labels:  speech-recognition
Deepspeechrecognition
A Chinese Deep Speech Recognition System 包括基于深度学习的声学模型和基于深度学习的语言模型
Stars: ✭ 1,421 (-15.62%)
Mutual labels:  speech-recognition
Project alias
Alias is a teachable “parasite” that is designed to give users more control over their smart assistants, both when it comes to customisation and privacy. Through a simple app the user can train Alias to react on a custom wake-word/sound, and once trained, Alias can take control over your home assistant by activating it for you.
Stars: ✭ 1,577 (-6.35%)
Mutual labels:  speech-recognition
Nonautoreggenprogress
Tracking the progress in non-autoregressive generation (translation, transcription, etc.)
Stars: ✭ 118 (-92.99%)
Mutual labels:  speech-recognition

wer_are_we

WER are we? An attempt at tracking states of the art(s) and recent results on speech recognition. Feel free to correct! (Inspired by Are we there yet?)

WER

LibriSpeech

(Possibly trained on more data than LibriSpeech.)

WER test-clean WER test-other Paper Published Notes
5.83% 12.69% Humans Deep Speech 2: End-to-End Speech Recognition in English and Mandarin December 2015 Humans
1.8% 2.9% HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units June 2021 CNN-Transformer + Transformer LM (Self-Supervised, Libri-light-60K Unlabeled Data)
1.9% 3.9% Conformer: Convolution-augmented Transformer for Speech Recognition May 2020 Convolution-augmented-Transformer(Conformer) + 3-layer LSTM LM (data augmentation:SpecAugment)
1.9% 4.1% ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context May 2020 CNN-RNN-Transducer(ContextNet) + 3-layer LSTM LM (data augmentation:SpecAugment)
2.0% 4.1% End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures November 2019 Conv+Transformer AM (10k word pieces) with ConvLM decoding and Transformer rescoring + 60k hours unlabeled
2.3% 4.9% Transformer-based Acoustic Modeling for Hybrid Speech Recognition October 2019 Transformer AM (chenones) + 4-gram LM + Neural LM rescore (data augmentation:Speed perturbation and SpecAugment)
2.3% 5.0% RWTH ASR Systems for LibriSpeech: Hybrid vs Attention September 2019, Interspeech HMM-DNN + lattice-based sMBR + LSTM LM + Transformer LM rescoring (no data augmentation)
2.3% 5.2% End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures November 2019 Conv+Transformer AM (10k word pieces) with ConvLM decoding and Transformer rescoring
2.2% 5.8% State-of-the-Art Speech Recognition Using Multi-Stream Self-Attention With Dilated 1D Convolutions October 2019 Multi-stream self-attention in hybrid ASR + 4-gram LM + Neural LM rescore (no data augmentation)
2.5% 5.8% SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition April 2019 Listen Attend Spell
3.2% 7.6% From Senones to Chenones: Tied Context-Dependent Graphemes for Hybrid Speech Recognition October 2019 LC-BLSTM AM (chenones) + 4-gram LM (data augmentation:Speed perturbation and SpecAugment)
3.19% 7.64% The CAPIO 2017 Conversational Speech Recognition System April 2018 TDNN + TDNN-LSTM + CNN-bLSTM + Dense TDNN-LSTM across two kinds of trees + N-gram LM + Neural LM rescore
2.44% 8.29% Improved Vocal Tract Length Perturbation for a State-of-the-Art End-to-End Speech Recognition System September 2019, Interspeech encoder-attention-decoder + Transformer LM
3.80% 8.76% Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks Interspeech, Sept 2018 Kaldi recipe, 17-layer TDNN-F + iVectors
2.8% 9.3% RWTH ASR Systems for LibriSpeech: Hybrid vs Attention September 2019, Interspeech encoder-attention-decoder + BPE + Transformer LM (no data augmentation)
3.26% 10.47% Fully Convolutional Speech Recognition December 2018 End-to-end CNN on the waveform + conv LM
3.82% 12.76% Improved training of end-to-end attention models for speech recognition Interspeech, Sept 2018 encoder-attention-decoder end-to-end model
4.28% Purely sequence-trained neural networks for ASR based on lattice-free MMI September 2016 HMM-TDNN trained with MMI + data augmentation (speed) + iVectors + 3 regularizations
4.83% A time delay neural network architecture for efficient modeling of long temporal contexts 2015 HMM-TDNN + iVectors
5.15% 12.73% Deep Speech 2: End-to-End Speech Recognition in English and Mandarin December 2015 9-layer model w/ 2 layers of 2D-invariant convolution & 7 recurrent layers, w/ 100M parameters trained on 11940h
5.51% 13.97% LibriSpeech: an ASR Corpus Based on Public Domain Audio Books 2015 HMM-DNN + pNorm*
4.8% 14.5% Letter-Based Speech Recognition with Gated ConvNets December 2017 (Gated) ConvNet for AM going to letters + 4-gram LM
8.01% 22.49% same, Kaldi 2015 HMM-(SAT)GMM
12.51% Audio Augmentation for Speech Recognition 2015 TDNN + pNorm + speed up/down speech

WSJ

(Possibly trained on more data than WSJ.)

WER eval'92 WER eval'93 Paper Published Notes
5.03% 8.08% Humans Deep Speech 2: End-to-End Speech Recognition in English and Mandarin December 2015 Humans
2.9% End-to-end Speech Recognition Using Lattice-Free MMI September 2018 HMM-DNN LF-MMI trained (biphone)
3.10% Deep Speech 2: End-to-End Speech Recognition in English and Mandarin December 2015 9-layer model w/ 2 layers of 2D-invariant convolution & 7 recurrent layers, w/ 100M parameters
3.47% Deep Recurrent Neural Networks for Acoustic Modelling April 2015 TC-DNN-BLSTM-DNN
3.5% 6.8% Fully Convolutional Speech Recognition December 2018 End-to-end CNN on the waveform + conv LM
3.63% 5.66% LibriSpeech: an ASR Corpus Based on Public Domain Audio Books 2015 test-set on open vocabulary (i.e. harder), model = HMM-DNN + pNorm*
4.1% End-to-end Speech Recognition Using Lattice-Free MMI September 2018 HMM-DNN E2E LF-MMI trained (word n-gram)
5.6% Convolutional Neural Networks-based Continuous Speech Recognition using Raw Speech Signal 2014 CNN over RAW speech (wav)
5.7% 8.7% End-to-end Speech Recognition from the Raw Waveform June 2018 End-to-end CNN on the waveform

Hub5'00 Evaluation (Switchboard / CallHome)

(Possibly trained on more data than SWB, but test set = full Hub5'00.)

WER (SWB) WER (CH) Paper Published Notes
4.9% 9.5% An investigation of phone-based subword units for end-to-end speech recognition April 2020 2 CNN + 24 layers Transformer encoder and 12 layers Transformer decoder model with char BPE and phoneme BPE units.
5.0% 9.1% The CAPIO 2017 Conversational Speech Recognition System December 2017 2 Dense LSTMs + 3 CNN-bLSTMs across 3 phonesets from previous Capio paper & AM adaptation using parameter averaging (5.6% SWB / 10.5% CH single systems)
5.1% 9.9% Language Modeling with Highway LSTM September 2017 HW-LSTM LM trained with Switchboard+Fisher+Gigaword+Broadcast News+Conversations, AM from previous IBM paper
5.1% The Microsoft 2017 Conversational Speech Recognition System August 2017 ~2016 system + character-based dialog session aware (turns of speech) LSTM LM
5.3% 10.1% Deep Learning-based Telephony Speech Recognition in the Wild August 2017 Ensemble of 3 CNN-bLSTM (5.7% SWB / 11.3% CH single systems)
5.5% 10.3% English Conversational Telephone Speech Recognition by Humans and Machines March 2017 ResNet + BiLSTMs acoustic model, with 40d FMLLR + i-Vector inputs, trained on SWB+Fisher+CH, n-gram + model-M + LSTM + Strided (à trous) convs-based LM trained on Switchboard+Fisher+Gigaword+Broadcast
6.3% 11.9% The Microsoft 2016 Conversational Speech Recognition System September 2016 VGG/Resnet/LACE/BiLSTM acoustic model trained on SWB+Fisher+CH, N-gram + RNNLM language model trained on Switchboard+Fisher+Gigaword+Broadcast
6.3% 13.3% An investigation of phone-based subword units for end-to-end speech recognition April 2020 2 CNN + 24 layers Transformer encoder and 12 layers Transformer decoder model with char BPE and phoneme BPE units. Trained only on SWBD 300 hours.
6.6% 12.2% The IBM 2016 English Conversational Telephone Speech Recognition System June 2016 RNN + VGG + LSTM acoustic model trained on SWB+Fisher+CH, N-gram + "model M" + NNLM language model
6.8% 14.1% SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition April 2019 Listen Attend Spell
8.5% 13% Purely sequence-trained neural networks for ASR based on lattice-free MMI September 2016 HMM-BLSTM trained with MMI + data augmentation (speed) + iVectors + 3 regularizations + Fisher
9.2% 13.3% Purely sequence-trained neural networks for ASR based on lattice-free MMI September 2016 HMM-TDNN trained with MMI + data augmentation (speed) + iVectors + 3 regularizations + Fisher (10% / 15.1% respectively trained on SWBD only)
12.6% 16% Deep Speech: Scaling up end-to-end speech recognition December 2014 CNN + Bi-RNN + CTC (speech to letters), 25.9% WER if trained only on SWB
11% 17.1% A time delay neural network architecture for efficient modeling of long temporal contexts 2015 HMM-TDNN + iVectors
12.6% 18.4% Sequence-discriminative training of deep neural networks 2013 HMM-DNN +sMBR
12.9% 19.3% Audio Augmentation for Speech Recognition 2015 HMM-TDNN + pNorm + speed up/down speech
15% 19.1% Building DNN Acoustic Models for Large Vocabulary Speech Recognition June 2014 DNN + Dropout
10.4% Joint Training of Convolutional and Non-Convolutional Neural Networks 2014 CNN on MFSC/fbanks + 1 non-conv layer for FMLLR/I-Vectors concatenated in a DNN
11.5% Deep Convolutional Neural Networks for LVCSR 2013 CNN
12.2% Very Deep Multilingual Convolutional Neural Networks for LVCSR September 2015 Deep CNN (10 conv, 4 FC layers), multi-scale feature maps
11.8% 25.7% Improved training of end-to-end attention models for speech recognition Interspeech, Sept 2018 encoder-attention-decoder end-to-end model, trained on 300h SWB

Rich Transcriptions

WER RT-02 WER RT-03 WER RT-04 Paper Published Notes
8.1% 8.0% The CAPIO 2017 Conversational Speech Recognition System April 2018 2 Dense LSTMs + 3 CNN-bLSTMs across 3 phonesets from previous Capio paper & AM adaptation using parameter averaging
8.2% 8.1% 7.7% Language Modeling with Highway LSTM September 2017 HW-LSTM LM trained with Switchboard+Fisher+Gigaword+Broadcast News+Conversations, AM from previous IBM paper
8.3% 8.0% 7.7% English Conversational Telephone Speech Recognition by Humans and Machines March 2017 ResNet + BiLSTMs acoustic model, with 40d FMLLR + i-Vector inputs, trained on SWB+Fisher+CH, n-gram + model-M + LSTM + Strided (à trous) convs-based LM trained on Switchboard+Fisher+Gigaword+Broadcast

Fisher (RT03S FSH)

WER Paper Published Notes
9.6% Purely sequence-trained neural networks for ASR based on lattice-free MMI September 2016 HMM-BLSTM trained with MMI + data augmentation (speed) + iVectors + 3 regularizations + SWBD
9.8% Purely sequence-trained neural networks for ASR based on lattice-free MMI September 2016 HMM-TDNN trained with MMI + data augmentation (speed) + iVectors + 3 regularizations + SWBD

TED-LIUM

WER Test Paper Published Notes
5.6% The RWTH ASR System for TED-LIUM release 2: Improving Hybrid HMM with SpecAugment April 2020 HMM-BLSTM + iVectors + SpecAugment + sMBR + Transformer LM
6.5% The CAPIO 2017 Conversational Speech Recognition System April 2018 TDNN + TDNN-LSTM + CNN-bLSTM + Dense TDNN-LSTM across two kinds of trees
11.2% Purely sequence-trained neural networks for ASR based on lattice-free MMI September 2016 HMM-TDNN trained with LF-MMI + data augmentation (speed perturbation) + iVectors + 3 regularizations
15.3% TED-LIUM: an Automatic Speech Recognition dedicated corpus May 2014 Multi-layer perceptron (MLP) with bottle-neck feature extraction

CHiME6 (multiarray noisy speech)

WER (fixed LM) WER (unlimited LM) Paper Published Notes
31.0% 30.5% The USTC-NELSLIP Systems for CHiME-6 Challenge May 2020 WPE + SSA + GSS + Data Augment (Speed, Volume) + SpecAugment + 8 AMs fusion (2 Single-feature AM + 6 Multi-feature AM)
35.1% 34.5% The IOA Systems for CHiME-6 Challenge May 2020 WPE + multi-stage GSS + SpecAugment + Data Augment (Noise, Reverberation, Speed) + 3 AMs fusion (CNN-TDNNF / CNN-TDNN-BLSTM / CNN-BLSTM)
35.8% 33.9% The STC System for the CHiME-6 Challenge May 2020 WPE + GSS + SpecAugment + 3 AMs fusion ( 2 TDNN-F / CNN-TDNNF + stats + SpecAugment + self-attention + sMBR) + MBR Decoding
51.3% 51.3% CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings May 2020 WPE + GSS + Data Augment (Noise, Reverberation, Speed) + TDNNF

CHiME (noisy speech)

clean real sim Paper Published Notes
3.34% 21.79% 45.05% Deep Speech 2: End-to-End Speech Recognition in English and Mandarin December 2015 9-layer model w/ 2 layers of 2D-invariant convolution & 7 recurrent layers, w/ 68M parameters
6.30% 67.94% 80.27% Deep Speech: Scaling up end-to-end speech recognition December, 2014 CNN + Bi-RNN + CTC (speech to letters)

TODO

PER

TIMIT

(So far, all results trained on TIMIT and tested on the core test set.)

PER Paper Published Notes
12.9% Instantaneous Frequency Filter-Bank Features for Low Resource Speech Recognition Using Deep Recurrent Architectures September 2021 Li-GRU with FMLLR + IFFB + FBANK + IFFB-FMLLR features
13.8% The Pytorch-Kaldi Speech Recognition Toolkit February 2019 MLP+Li-GRU+MLP on MFCC+FBANK+fMLLR. Silence phones are removed from reference and hypothesis transcripts!
14.9% Light Gated Recurrent Units for Speech Recognition March 2018 Removing the reset gate in GRU, using ReLU activation instead of tanh and batch normalization
16.5% Phone recognition with hierarchical convolutional deep maxout networks September 2015 Hierarchical maxout CNN + Dropout
16.5% A Regularization Post Layer: An Additional Way how to Make Deep Neural Networks Robust 2017 DBN with last layer regularization
16.7% Combining Time- and Frequency-Domain Convolution in Convolutional Neural Network-Based Phone Recognition 2014 CNN in time and frequency + dropout, 17.6% w/o dropout
16.8% An investigation into instantaneous frequency estimation methods for improved speech recognition features November 2017 DNN-HMM with MFCC + IFCC features
17.3% Segmental Recurrent Neural Networks for End-to-end Speech Recognition March 2016 RNN-CRF on 24(x3) MFSC
17.6% Attention-Based Models for Speech Recognition June 2015 Bi-RNN + Attention
17.7% Speech Recognition with Deep Recurrent Neural Networks March 2013 Bi-LSTM + skip connections w/ RNN transducer (18.4% with CTC only)
18.0% Learning Filterbanks from Raw Speech for Phone Recognition October 2017 Complex ConvNets on raw speech w/ mel-fbanks init
18.8% Wavenet: A Generative Model For Raw Audio September 2016 Wavenet architecture with mean pooling layer after residual block + few non-causal conv layers
23% Deep Belief Networks for Phone Recognition 2009 (first, modern) HMM-DBN

LM

TODO

Noise-robust ASR

TODO

BigCorp™®-specific dataset

TODO?

Lexicon

  • WER: word error rate
  • PER: phone error rate
  • LM: language model
  • HMM: hidden markov model
  • GMM: Gaussian mixture model
  • DNN: deep neural network
  • CNN: convolutional neural network
  • DBN: deep belief network (RBM-based DNN)
  • TDNN-F: a factored form of time delay neural networks (TDNN)
  • RNN: recurrent neural network
  • LSTM: long short-term memory
  • CTC: connectionist temporal classification
  • MMI: maximum mutual information (MMI),
  • MPE: minimum phone error
  • sMBR: state-level minimum Bayes risk
  • SAT: speaker adaptive training
  • MLLR: maximum likelihood linear regression
  • FMLLR: Feature space Maximum Likelihood Linear Regression
  • LDA: (in this context) linear discriminant analysis
  • MFCC: Mel frequency cepstral coefficients
  • FB/FBANKS/MFSC: Mel frequency spectral coefficients
  • IFCC: Instantaneous frequency cosine coefficients (https://github.com/siplabiith/IFCC-Feature-Extraction)
  • IFFB: Instantaneous frequency filter-bank features
  • VGG: very deep convolutional neural networks from Visual Graphics Group, VGG is an architecture of 2 {3x3 convolutions} followed by 1 pooling, repeated
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].