All Projects → cvqluu → Factorized Tdnn

cvqluu / Factorized Tdnn

Licence: mit
PyTorch implementation of the Factorized TDNN (TDNN-F) from "Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks" and Kaldi

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Factorized Tdnn

Deepspeech
DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
Stars: ✭ 18,680 (+18961.22%)
Mutual labels:  neural-networks, speech-recognition
Awesome Kaldi
This is a list of features, scripts, blogs and resources for better using Kaldi ( http://kaldi-asr.org/ )
Stars: ✭ 393 (+301.02%)
Mutual labels:  speech-recognition, kaldi
Brevitas
Brevitas: quantization-aware training in PyTorch
Stars: ✭ 343 (+250%)
Mutual labels:  neural-networks, speech-recognition
speech-to-text
mixlingual speech recognition system; hybrid (GMM+NNet) model; Kaldi + Keras
Stars: ✭ 61 (-37.76%)
Mutual labels:  speech-recognition, kaldi
Espresso
Espresso: A Fast End-to-End Neural Speech Recognition Toolkit
Stars: ✭ 808 (+724.49%)
Mutual labels:  speech-recognition, kaldi
Vosk Android Demo
Offline speech recognition for Android with Vosk library.
Stars: ✭ 271 (+176.53%)
Mutual labels:  speech-recognition, kaldi
Zamia Speech
Open tools and data for cloudless automatic speech recognition
Stars: ✭ 374 (+281.63%)
Mutual labels:  speech-recognition, kaldi
rustfst
Rust re-implementation of OpenFST - library for constructing, combining, optimizing, and searching weighted finite-state transducers (FSTs). A Python binding is also available.
Stars: ✭ 104 (+6.12%)
Mutual labels:  speech-recognition, kaldi
Sincnet
SincNet is a neural architecture for efficiently processing raw audio samples.
Stars: ✭ 764 (+679.59%)
Mutual labels:  neural-networks, speech-recognition
Pykaldi
A Python wrapper for Kaldi
Stars: ✭ 756 (+671.43%)
Mutual labels:  speech-recognition, kaldi
vosk-model-ru-adaptation
No description or website provided.
Stars: ✭ 19 (-80.61%)
Mutual labels:  speech-recognition, kaldi
Dragonfire
the open-source virtual assistant for Ubuntu based Linux distributions
Stars: ✭ 1,120 (+1042.86%)
Mutual labels:  speech-recognition, kaldi
srvk-eesen-offline-transcriber
Top level code to transcribe English audio/video files into text/subtitles
Stars: ✭ 22 (-77.55%)
Mutual labels:  speech-recognition, kaldi
Vosk Server
WebSocket, gRPC and WebRTC speech recognition server based on Vosk and Kaldi libraries
Stars: ✭ 277 (+182.65%)
Mutual labels:  speech-recognition, kaldi
kaldi-long-audio-alignment
Long audio alignment using Kaldi
Stars: ✭ 21 (-78.57%)
Mutual labels:  speech-recognition, kaldi
Espnet
End-to-End Speech Processing Toolkit
Stars: ✭ 4,533 (+4525.51%)
Mutual labels:  speech-recognition, kaldi
Speechbrain.github.io
The SpeechBrain project aims to build a novel speech toolkit fully based on PyTorch. With SpeechBrain users can easily create speech processing systems, ranging from speech recognition (both HMM/DNN and end-to-end), speaker recognition, speech enhancement, speech separation, multi-microphone speech processing, and many others.
Stars: ✭ 242 (+146.94%)
Mutual labels:  neural-networks, speech-recognition
kaldi ag training
Docker image and scripts for training finetuned or completely personal Kaldi speech models. Particularly for use with kaldi-active-grammar.
Stars: ✭ 14 (-85.71%)
Mutual labels:  speech-recognition, kaldi
Eesen
The official repository of the Eesen project
Stars: ✭ 738 (+653.06%)
Mutual labels:  speech-recognition, kaldi
Kur
Descriptive Deep Learning
Stars: ✭ 811 (+727.55%)
Mutual labels:  neural-networks, speech-recognition

Factorized-TDNN

PyTorch implementation of the Factorized TDNN (TDNN-F) from "Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks"[1]. This is also known as TDNN-F in nnet3 of Kaldi.

model_fig Taken from [1]

A TDNN-F layer is implemented in the class FTDNNLayer of models.py. To be specific to the description in [1], it is an implementation of the "3-stage splicing" implementation, in which three convolutions are used in sequence, with the first two being constrained to be semi-orthogonal. These convolutions are followed by a ReLU and then BatchNorm layer. The semi-orthogonal constraint is the "floating case" in [1]. (TODO: implement the scaled case like in Kaldi)

Usage

FTDNNLayer

This FTDNNLayer of models.py is used as follows:

import torch
from models import FTDNNLayer, SOrthConv

tdnn_f = FTDNNLayer(1280, 512, 256, context_size=2, dilations=[2,2,2], paddings=[1,1,1])
# This is a sequence of three 2x1 convolutions
# dimensions go from 1280 -> 256 -> 256 -> 512
# dilations and paddings handles how much to dilate and pad each convolution
# Having these configurable is to ensure the sequence length stays the same

test_input = torch.rand(5, 100, 1280)
# inputs to the FTDNNLayer must be (batch_size, seq_len, in_dim)

tdnn_f(test_input).shape # returns (5, 100, 512)

tdnn_f.step_semi_orth() # The key method to constrain the first two convolutions, perform after every SGD step

tdnn_f.orth_error() # This returns the orth error of the constrained convs, useful for debugging

SOrthConv

The components of FTDNNLayer which have the semi-orthogonal constraint are based around the class SOrthConv, which is essentially a nn.Conv1d with a .step_semi_orth() method to perform the semi-orthogonal update as in [1].

sorth_conv = SOrthConv(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, padding_mode='zeros')

The implementation of the .step_semi_orth() method has been made to be as close to ConstrainOrthonormalInternal from nnet-utils.cc in Kaldi's nnet3 module.

Extras

Also included in this repo in models.py is the following:

  • FTDNN: Factorized TDNN x-vector architecture (FTDNN) up to the embedding layer seen in "State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and Speakers in the Wild evaluations"[2]. (This is not EXACTLY the same, but should be close enough).
  • SharedDimScaleDropout: The shared dimension scaled dropout described in [1] and in Kaldi:
    • Instead of randomly setting inputs to 0, use a continuous dropout scale.
    • For a dropout 'strength' alpha, multiply inputs inputs by a mask sampled from the uniform distribution on the interval [1 - 2 * alpha, 1 + 2 * alpha].
    • Share dropout masks along a dimension, such as time. From [1]: "If, for instance, a dimension is zeroed on a particular frame it will be zeroed on all frames of that sequence".

model_fig The FTDNN x-vector architecture description taken from [2]. Up until layer 12 is implemented in FTDNN in models.py.

Demo [WIP]

An demonstration of the FTDNN model being trained can be seen in the following output log (code not included, TODO: basic experiment demo):

exp/sp_ftdnn_bl: Wed Nov 20 14:21:15 2019: [10/120000]   C-Loss:21.9116, AvgLoss:21.6991, lr: 0.2, bs: 400
Orth error: 22.44341427081963
exp/sp_ftdnn_bl: Wed Nov 20 14:21:29 2019: [20/120000]   C-Loss:21.6260, AvgLoss:21.7459, lr: 0.2, bs: 400
Orth error: 8.235212338215206
exp/sp_ftdnn_bl: Wed Nov 20 14:21:43 2019: [30/120000]   C-Loss:21.7663, AvgLoss:21.7525, lr: 0.2, bs: 400
Orth error: 1.2611256236341433
exp/sp_ftdnn_bl: Wed Nov 20 14:21:56 2019: [40/120000]   C-Loss:21.6153, AvgLoss:21.6527, lr: 0.2, bs: 400
Orth error: 0.005309408872562926
exp/sp_ftdnn_bl: Wed Nov 20 14:22:14 2019: [50/120000]   C-Loss:21.0997, AvgLoss:21.5722, lr: 0.2, bs: 400
Orth error: 0.005543942232179688
exp/sp_ftdnn_bl: Wed Nov 20 14:22:26 2019: [60/120000]   C-Loss:21.2629, AvgLoss:21.5222, lr: 0.2, bs: 400
Orth error: 0.004769200691953301
exp/sp_ftdnn_bl: Wed Nov 20 14:22:40 2019: [70/120000]   C-Loss:20.9551, AvgLoss:21.4158, lr: 0.2, bs: 400
Orth error: 0.006055477493646322
exp/sp_ftdnn_bl: Wed Nov 20 14:22:56 2019: [80/120000]   C-Loss:20.4425, AvgLoss:21.3274, lr: 0.2, bs: 400
Orth error: 0.009634702852054033
exp/sp_ftdnn_bl: Wed Nov 20 14:23:09 2019: [90/120000]   C-Loss:21.0025, AvgLoss:21.2727, lr: 0.2, bs: 400
Orth error: 0.00611297079740325
exp/sp_ftdnn_bl: Wed Nov 20 14:23:25 2019: [100/120000]          C-Loss:20.6145, AvgLoss:21.1736, lr: 0.2, bs: 400
Orth error: 0.008151484609697945
exp/sp_ftdnn_bl: Wed Nov 20 14:23:38 2019: [110/120000]          C-Loss:20.1985, AvgLoss:21.0890, lr: 0.2, bs: 400
Orth error: 0.0072971017434610985
exp/sp_ftdnn_bl: Wed Nov 20 14:23:53 2019: [120/120000]          C-Loss:20.5698, AvgLoss:21.0300, lr: 0.2, bs: 400
Orth error: 0.00629939052669215
exp/sp_ftdnn_bl: Wed Nov 20 14:24:08 2019: [130/120000]          C-Loss:20.2024, AvgLoss:20.9425, lr: 0.2, bs: 400
Orth error: 0.008707787481398555
exp/sp_ftdnn_bl: Wed Nov 20 14:24:21 2019: [140/120000]          C-Loss:19.7034, AvgLoss:20.8641, lr: 0.2, bs: 400
Orth error: 0.010941843771433923
exp/sp_ftdnn_bl: Wed Nov 20 14:24:37 2019: [150/120000]          C-Loss:19.9718, AvgLoss:20.8035, lr: 0.2, bs: 400
Orth error: 0.00768740743296803

The FTDNN x-vector architecture seems to train successfully, and most importantly the Orth error is minimized.

TODOs

  • Implement 'scaled' case of semi-orthogonal constraint
  • Refactor so that seq_len is final dim (or not?)
  • Simple experiment/toy demo

References

[1]
@inproceedings{Povey2018,
  author={Daniel Povey and Gaofeng Cheng and Yiming Wang and Ke Li and Hainan Xu and Mahsa Yarmohammadi and Sanjeev Khudanpur},
  title={Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={3743--3747},
  doi={10.21437/Interspeech.2018-1417},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1417}
}
[2]
@article{VILLALBA2020101026,
    title = "State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and Speakers in the Wild evaluations",
    journal = "Computer Speech & Language",
    volume = "60",
    pages = "101026",
    year = "2020",
    issn = "0885-2308",
    doi = "https://doi.org/10.1016/j.csl.2019.101026",
    url = "http://www.sciencedirect.com/science/article/pii/S0885230819302700",
    author = "Jesús Villalba and Nanxin Chen and David Snyder and Daniel Garcia-Romero and Alan McCree and Gregory Sell and Jonas Borgstrom and Leibny Paola García-Perera and Fred Richardson and Réda Dehak and Pedro A. Torres-Carrasquillo and Najim Dehak"
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].