All Projects → alumae → Gst Kaldi Nnet2 Online

alumae / Gst Kaldi Nnet2 Online

Licence: apache-2.0
GStreamer plugin around Kaldi's online neural network decoder

Projects that are alternatives of or similar to Gst Kaldi Nnet2 Online

Allosaurus
Allosaurus is a pretrained universal phone recognizer for more than 2000 languages
Stars: ✭ 135 (-21.05%)
Mutual labels:  speech-recognition
Swiftspeech
A speech recognition framework designed for SwiftUI.
Stars: ✭ 149 (-12.87%)
Mutual labels:  speech-recognition
Kaldiio
A pure python module for reading and writing kaldi ark files
Stars: ✭ 160 (-6.43%)
Mutual labels:  speech-recognition
Aimybox Android Assistant
Embeddable custom voice assistant for Android applications
Stars: ✭ 139 (-18.71%)
Mutual labels:  speech-recognition
Speech Recognition Neural Network
This is the end-to-end Speech Recognition neural network, deployed in Keras. This was my final project for Artificial Intelligence Nanodegree @Udacity.
Stars: ✭ 148 (-13.45%)
Mutual labels:  speech-recognition
Clovacall
ClovaCall dataset and Pytorch LAS baseline code (Interspeech 2020)
Stars: ✭ 151 (-11.7%)
Mutual labels:  speech-recognition
Voice activity detection
Voice Activity Detection based on Deep Learning & TensorFlow
Stars: ✭ 132 (-22.81%)
Mutual labels:  speech-recognition
Pytorch Kaldi
pytorch-kaldi is a project for developing state-of-the-art DNN/RNN hybrid speech recognition systems. The DNN part is managed by pytorch, while feature extraction, label computation, and decoding are performed with the kaldi toolkit.
Stars: ✭ 2,097 (+1126.32%)
Mutual labels:  speech-recognition
Awesome Speech Recognition Speech Synthesis Papers
Automatic Speech Recognition (ASR), Speaker Verification, Speech Synthesis, Text-to-Speech (TTS), Language Modelling, Singing Voice Synthesis (SVS), Voice Conversion (VC)
Stars: ✭ 2,085 (+1119.3%)
Mutual labels:  speech-recognition
Interspeech2019 Tutorial
INTERSPEECH 2019 Tutorial Materials
Stars: ✭ 160 (-6.43%)
Mutual labels:  speech-recognition
Dla
Deep learning for audio processing
Stars: ✭ 142 (-16.96%)
Mutual labels:  speech-recognition
Zzz Retired openstt
RETIRED - OpenSTT is now retired. If you would like more information on Mycroft AI's open source STT projects, please visit:
Stars: ✭ 146 (-14.62%)
Mutual labels:  speech-recognition
Py Kaldi Asr
Some simple wrappers around kaldi-asr intended to make using kaldi's (online) decoders as convenient as possible.
Stars: ✭ 156 (-8.77%)
Mutual labels:  speech-recognition
Go Astideepspeech
Golang bindings for Mozilla's DeepSpeech speech-to-text library
Stars: ✭ 137 (-19.88%)
Mutual labels:  speech-recognition
Hey Jetson
Deep Learning based Automatic Speech Recognition with attention for the Nvidia Jetson.
Stars: ✭ 161 (-5.85%)
Mutual labels:  speech-recognition
Awesome Ai Services
An overview of the AI-as-a-service landscape
Stars: ✭ 133 (-22.22%)
Mutual labels:  speech-recognition
Speech To Text Russian
Проект для распознавания речи на русском языке на основе pykaldi.
Stars: ✭ 151 (-11.7%)
Mutual labels:  speech-recognition
Naomi
The Naomi Project is an open source, technology agnostic platform for developing always-on, voice-controlled applications!
Stars: ✭ 171 (+0%)
Mutual labels:  speech-recognition
Tacotron asr
Speech Recognition Using Tacotron
Stars: ✭ 165 (-3.51%)
Mutual labels:  speech-recognition
Rnnt Speech Recognition
End-to-end speech recognition using RNN Transducers in Tensorflow 2.0
Stars: ✭ 158 (-7.6%)
Mutual labels:  speech-recognition

WHAT IT IS

GStreamer plugin that wraps Kaldi's SingleUtteranceNnet2Decoder. It requires iVector-adapted DNN acoustic models. The iVectors are adapted to the current audio stream automatically.

CHANGELOG

2019-10-08: Added online CMVN functionality. Needs Kaldi as of Sep 7, 2019 or later. Also refactored N-best list, word alignment and confidence handling.

2019-08-25: Updated for changed Kaldi API for stream processing, make sure Kaldi is at least revision 6c816e5.

2017-08-09: Also implemented phone confidence scores, which are included with phone alignments. Note that phone confidence scores are very inaccurate.

2017-08-09: Implemented word confidences. Confidence scores are included with word alignments, so set word-boundary-file=dir/word_boundary.int, since this invokes word alignment generation.

2017-06-05: Now works with Kaldi 5.2 (i.e., the latest), also requires it. Decoder now also sets test mode for dropout and batch norm layers, and uses more memory-efficient FST representation for reading the decode graph (HCLG.fst).

2017-02-23: Now works with the latest Kaldi (also requires very fresh Kaldi). This also means that LSTM nnet3 models (including chain models) are supported (not tested).

2016-10-14: Support for nnet3 (including 'chain') models, thanks to @yifan! Not tested very carefully. Set the nnet-mode property to 3 to use nnet3 models. Note that this property must be set before the other properties.

2016-06-23: Got rid of the GStreamer property warnings at startup, thanks to @MathieuDuponchelle.

2015-11-05: Minor changes for compatibility with Ubuntu 12.04 (and its older version of Jansson). Also use ReadDecodeGraph to read fst file, supporting the use of const (mapped) fsts.

2015-06-03: Full results can now inlude word alignment information for the best hypothesis. In order to activate this, set the word-boundary-file property to point to the word_boundary.int file typically found in the data/lang/phones/ subdirectory of a trained system.

2015-06-02: Added functionality to output structured recognition results, that optionally include n-best results, phone alignment information and possibly other things (word alignments, sausages, lattices, sentence and word confidence scores) in the future. The structured results are pushed using the signal full-final-result, and are formatted using a JSON encoding. See below for a full description. Also removed the individual signals that were used for pushing out phone alignments.

2015-04-30: Added functionality to change models (FST, acoustic model, big language model) after initial initialization. Added functionality to ouput phone aligment information (see properties do-phone-alignment and phone-syms). Both additions by Ricard Marxer (@rikrd).

2015-04-28: Endpointing and interim recognition results now work correctly when using the threaded decoder. Note that endpointing does not work exactly the same as with the unthreaded decoder, and there might be some differences between individual decoder runs, but for all practical purposes it should be OK. Also, introduced a new property traceback-period-in-secs that specifies how often intermediate results are sent to the client (default is 0.5). NB: this update requires Kaldi revision 5036 or later.

2015-03-05: Threaded decoder can now be selected at configuration time, using the use-threaded-decoder property. NB: this property should be set before other properties. Endpointing and partial results might still not work as expected with the threaded decoder.

2015-03-01: one can now optionally use the threaded online decoder. Citing Daniel Povey who developed the code in Kaldi: "This should make it possible to decode in real-time with larger models and graphs than before, because the decoding and the nnet evaluation are in separate threads and can be done in parallel." Use make CPPFLAGS=-DTHREADED_DECODER to compile it. Note that the endpointing and partial results might not work as expected with the threaded decoder, working on the fix.

2015-01-09: Added language model rescoring functionality. In order to use it, you have to specify two properties: lm-fst and big-lm-const-arpa. The lm-fst property gives the location of the original LM (the one that was used fpr compiling the HCLG.fst used during decoding). The big-lm-const-arpa property gives the location of the big LM used that is used to rescore the final lattices. The big LM must be in the 'ConstArpaLm' format, use the Kaldi's utils/build_const_arpa_lm.sh script to produce it from the ARPA format.

2014-11-11: the plugin saves the adaptation state between silence-segmented utterances and between multiple decoding sessions of the same plugin instance. That is, if you start decoding a new stream, the adaptation state of the previous stream is used (unless it's the first stream, in which case a global mean is used). Use the adaptation-state plugin property to get, and set the adaptation state. Use an empty string with the set method to reset the adaptation state to default. This functionality requires Kaldi revision 4582 or later.

HOW TO COMPILE IT

The following works on Linux (I'm using Debian 'testing'; Ubuntu 12.04 also works).

Compile Kaldi trunk (make sure you are using the latest version), using the shared configuration: In Kaldi's 'src' directory:

./configure --shared
make depend
make

Install gstreamer-1.0:

sudo apt-get install gstreamer1.0-plugins-bad  gstreamer1.0-plugins-base gstreamer1.0-plugins-good  gstreamer1.0-pulseaudio  gstreamer1.0-plugins-ugly  gstreamer1.0-tools libgstreamer1.0-dev

On Ubuntu 12.04, you'll need to use a backport ppa to get version 1.0 of gstreamer, before doing the sudo apt-get install above:

sudo add-apt-repository ppa:gstreamer-developers/ppa
sudo apt-get update

Install the Jansson library development package (version 2.7 or newer), used for encoding results as JSON:

sudo apt-get install libjansson-dev

Now we can compile this plugin. Change to src of this project:

cd src

Compile, specifying Kaldi's root directory:

KALDI_ROOT=/path/of/kaldi-trunk make depend
KALDI_ROOT=/path/of/kaldi-trunk make

This should result in 'libgstkaldionline2.so'.

Test if GStreamer can access the plugin:

GST_PLUGIN_PATH=. gst-inspect-1.0 kaldinnet2onlinedecoder

The output should list all plugin properties with their default values:

Factory Details:
  Rank                     none (0)
  Long-name                KaldiNNet2OnlineDecoder
  Klass                    Speech/Audio
  Description              Convert speech to text
[...]
  name                : The name of the object
                        flags: readable, writable
                        String. Default: "kaldinnet2onlinedecoder0"
  parent              : The parent of the object
                        flags: readable, writable
                        Object of type "GstObject"
  silent              : Silence the decoder
                        flags: readable, writable
                        Boolean. Default: false
  model               : Filename of the acoustic model
                        flags: readable, writable
                        String. Default: "models/final.mdl"
[...]
  max-nnet-batch-size : Maximum batch size we use in neural-network decodable object, in cases where we are not constrained by currently available frames (this will rarely make a difference)
                        flags: readable, writable
                        Integer. Range: -2147483648 - 2147483647 Default: 256

Element Signals:
  "partial-result" :  void user_function (GstElement* object,
                                          gchararray arg0,
                                          gpointer user_data);
  "final-result" :  void user_function (GstElement* object,
                                        gchararray arg0,
                                        gpointer user_data);

HOW TO USE IT

Command-line usage is demonstrated in demo/.

Usage through GStreamer's Python bindings is demonstrated demo/gui-demo.py and in the project https://github.com/alumae/kaldi-gstreamer-server, in file kaldigstserver/decoder2.py.

STRUCTURED RESULTS

Below is a sample of JSON-encoded full recognition results, pushed out using the full-final-result signal. This sample was generated using do-phone-alignment=true and num-nbest=10 (although due to pruning it includes only two n-best hypotheses). Note that the words in the file specified by word-syms and the phones in the file specified in phone-syms must be encoded using UTF-8, otherwise the output won't be valid JSON.

{
  "status": 0,
  "result": {
    "hypotheses": [
      {
        "transcript": "we're not ready for the next epidemic",
        "likelihood": 128.174,
        "phone-alignment": [
          {
            "start": 0,
            "phone": "SIL",
            "length": 0.71
          },
          {
            "start": 0.71,
            "phone": "W_B",
            "length": 0.18
          },
          {
            "start": 0.89,
            "phone": "ER_E",
            "length": 0.06
          },
          {
            "start": 0.95,
            "phone": "N_B",
            "length": 0.06
          },
          {
            "start": 1.01,
            "phone": "AA_I",
            "length": 0.19
          },
          {
            "start": 1.2,
            "phone": "T_E",
            "length": 0.11
          },
          {
            "start": 1.31,
            "phone": "R_B",
            "length": 0.07
          },
          {
            "start": 1.38,
            "phone": "EH_I",
            "length": 0.1
          },
          {
            "start": 1.48,
            "phone": "D_I",
            "length": 0.05
          },
          {
            "start": 1.53,
            "phone": "IY_E",
            "length": 0.22
          },
          {
            "start": 1.75,
            "phone": "SIL",
            "length": 0.46
          },
          {
            "start": 2.21,
            "phone": "F_B",
            "length": 0.1
          },
          {
            "start": 2.31,
            "phone": "ER_E",
            "length": 0.05
          },
          {
            "start": 2.36,
            "phone": "DH_B",
            "length": 0.05
          },
          {
            "start": 2.41,
            "phone": "AH_E",
            "length": 0.05
          },
          {
            "start": 2.46,
            "phone": "N_B",
            "length": 0.06
          },
          {
            "start": 2.52,
            "phone": "EH_I",
            "length": 0.11
          },
          {
            "start": 2.63,
            "phone": "K_I",
            "length": 0.08
          },
          {
            "start": 2.71,
            "phone": "S_I",
            "length": 0.05
          },
          {
            "start": 2.76,
            "phone": "T_E",
            "length": 0.07
          },
          {
            "start": 2.83,
            "phone": "EH_B",
            "length": 0.08
          },
          {
            "start": 2.91,
            "phone": "P_I",
            "length": 0.09
          },
          {
            "start": 3,
            "phone": "AH_I",
            "length": 0.04
          },
          {
            "start": 3.04,
            "phone": "D_I",
            "length": 0.08
          },
          {
            "start": 3.12,
            "phone": "EH_I",
            "length": 0.1
          },
          {
            "start": 3.22,
            "phone": "M_I",
            "length": 0.08
          },
          {
            "start": 3.3,
            "phone": "IH_I",
            "length": 0.08
          },
          {
            "start": 3.38,
            "phone": "K_E",
            "length": 0.18
          },
          {
            "start": 3.56,
            "phone": "SIL",
            "length": 0.13
          }
        ],
        "word-alignment": [
          {
            "word": "we're",
            "start": 0.71,
            "length": 0.24
          },
          {
            "word": "not",
            "start": 0.95,
            "length": 0.36
          },
          {
            "word": "ready",
            "start": 1.31,
            "length": 0.44
          },
          {
            "word": "for",
            "start": 2.21,
            "length": 0.15
          },
          {
            "word": "the",
            "start": 2.36,
            "length": 0.1
          },
          {
            "word": "next",
            "start": 2.46,
            "length": 0.37
          },
          {
            "word": "epidemic",
            "start": 2.83,
            "length": 0.73
          }
        ]
      },
      {
        "transcript": "were not ready for the next epidemic",
        "likelihood": 125.323
      }
    ]
  },
  "segment-start": 58.25,
  "segment-length": 3.69,
  "total-length": 61.94
}

CITING

If you use this software for research, you can cite the following paper (available here: http://ebooks.iospress.nl/volumearticle/37996):

@inproceedigs{alumae2014,
  author={Tanel Alum\"{a}e},
  title="Full-duplex Speech-to-text System for {Estonian}",
  booktitle="Baltic HLT 2014",
  year=2014,
  address="Kaunas, Lithuania"
}

Of course, you should also acknowledge Kaldi, which does all the hard work.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].