Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → alumae → Gst Kaldi Nnet2 Online

alumae / Gst Kaldi Nnet2 Online

Licence: apache-2.0

GStreamer plugin around Kaldi's online neural network decoder

Labels

speech-recognition

Projects that are alternatives of or similar to Gst Kaldi Nnet2 Online

Allosaurus is a pretrained universal phone recognizer for more than 2000 languages

Stars: ✭ 135 (-21.05%)

Mutual labels: speech-recognition

A speech recognition framework designed for SwiftUI.

Stars: ✭ 149 (-12.87%)

Mutual labels: speech-recognition

A pure python module for reading and writing kaldi ark files

Stars: ✭ 160 (-6.43%)

Mutual labels: speech-recognition

Aimybox Android Assistant

Embeddable custom voice assistant for Android applications

Stars: ✭ 139 (-18.71%)

Mutual labels: speech-recognition

Speech Recognition Neural Network

This is the end-to-end Speech Recognition neural network, deployed in Keras. This was my final project for Artificial Intelligence Nanodegree @Udacity.

Stars: ✭ 148 (-13.45%)

Mutual labels: speech-recognition

ClovaCall dataset and Pytorch LAS baseline code (Interspeech 2020)

Stars: ✭ 151 (-11.7%)

Mutual labels: speech-recognition

Voice activity detection

Voice Activity Detection based on Deep Learning & TensorFlow

Stars: ✭ 132 (-22.81%)

Mutual labels: speech-recognition

pytorch-kaldi is a project for developing state-of-the-art DNN/RNN hybrid speech recognition systems. The DNN part is managed by pytorch, while feature extraction, label computation, and decoding are performed with the kaldi toolkit.

Stars: ✭ 2,097 (+1126.32%)

Mutual labels: speech-recognition

Awesome Speech Recognition Speech Synthesis Papers

Automatic Speech Recognition (ASR), Speaker Verification, Speech Synthesis, Text-to-Speech (TTS), Language Modelling, Singing Voice Synthesis (SVS), Voice Conversion (VC)

Stars: ✭ 2,085 (+1119.3%)

Mutual labels: speech-recognition

Interspeech2019 Tutorial

INTERSPEECH 2019 Tutorial Materials

Stars: ✭ 160 (-6.43%)

Mutual labels: speech-recognition

Deep learning for audio processing

Stars: ✭ 142 (-16.96%)

Mutual labels: speech-recognition

Zzz Retired openstt

RETIRED - OpenSTT is now retired. If you would like more information on Mycroft AI's open source STT projects, please visit:

Stars: ✭ 146 (-14.62%)

Mutual labels: speech-recognition

Some simple wrappers around kaldi-asr intended to make using kaldi's (online) decoders as convenient as possible.

Stars: ✭ 156 (-8.77%)

Mutual labels: speech-recognition

Go Astideepspeech

Golang bindings for Mozilla's DeepSpeech speech-to-text library

Stars: ✭ 137 (-19.88%)

Mutual labels: speech-recognition

Deep Learning based Automatic Speech Recognition with attention for the Nvidia Jetson.

Stars: ✭ 161 (-5.85%)

Mutual labels: speech-recognition

Awesome Ai Services

An overview of the AI-as-a-service landscape

Stars: ✭ 133 (-22.22%)

Mutual labels: speech-recognition

Speech To Text Russian

Проект для распознавания речи на русском языке на основе pykaldi.

Stars: ✭ 151 (-11.7%)

Mutual labels: speech-recognition

The Naomi Project is an open source, technology agnostic platform for developing always-on, voice-controlled applications!

Stars: ✭ 171 (+0%)

Mutual labels: speech-recognition

Speech Recognition Using Tacotron

Stars: ✭ 165 (-3.51%)

Mutual labels: speech-recognition

Rnnt Speech Recognition

End-to-end speech recognition using RNN Transducers in Tensorflow 2.0

Stars: ✭ 158 (-7.6%)

Mutual labels: speech-recognition

View All Similar Projects ➔

WHAT IT IS

GStreamer plugin that wraps Kaldi's SingleUtteranceNnet2Decoder. It requires iVector-adapted DNN acoustic models. The iVectors are adapted to the current audio stream automatically.

CHANGELOG

2019-10-08: Added online CMVN functionality. Needs Kaldi as of Sep 7, 2019 or later. Also refactored N-best list, word alignment and confidence handling.

2019-08-25: Updated for changed Kaldi API for stream processing, make sure Kaldi is at least revision 6c816e5.

2017-08-09: Also implemented phone confidence scores, which are included with phone alignments. Note that phone confidence scores are very inaccurate.

2017-08-09: Implemented word confidences. Confidence scores are included with word alignments, so set word-boundary-file=dir/word_boundary.int, since this invokes word alignment generation.

2017-06-05: Now works with Kaldi 5.2 (i.e., the latest), also requires it. Decoder now also sets test mode for dropout and batch norm layers, and uses more memory-efficient FST representation for reading the decode graph (HCLG.fst).

2017-02-23: Now works with the latest Kaldi (also requires very fresh Kaldi). This also means that LSTM nnet3 models (including chain models) are supported (not tested).

2016-10-14: Support for nnet3 (including 'chain') models, thanks to @yifan! Not tested very carefully. Set the nnet-mode property to 3 to use nnet3 models. Note that this property must be set before the other properties.

2016-06-23: Got rid of the GStreamer property warnings at startup, thanks to @MathieuDuponchelle.

2015-11-05: Minor changes for compatibility with Ubuntu 12.04 (and its older version of Jansson). Also use ReadDecodeGraph to read fst file, supporting the use of const (mapped) fsts.

2015-06-03: Full results can now inlude word alignment information for the best hypothesis. In order to activate this, set the word-boundary-file property to point to the word_boundary.int file typically found in the data/lang/phones/ subdirectory of a trained system.

2015-06-02: Added functionality to output structured recognition results, that optionally include n-best results, phone alignment information and possibly other things (word alignments, sausages, lattices, sentence and word confidence scores) in the future. The structured results are pushed using the signal full-final-result, and are formatted using a JSON encoding. See below for a full description. Also removed the individual signals that were used for pushing out phone alignments.

2015-04-30: Added functionality to change models (FST, acoustic model, big language model) after initial initialization. Added functionality to ouput phone aligment information (see properties do-phone-alignment and phone-syms). Both additions by Ricard Marxer (@rikrd).

2015-04-28: Endpointing and interim recognition results now work correctly when using the threaded decoder. Note that endpointing does not work exactly the same as with the unthreaded decoder, and there might be some differences between individual decoder runs, but for all practical purposes it should be OK. Also, introduced a new property traceback-period-in-secs that specifies how often intermediate results are sent to the client (default is 0.5). NB: this update requires Kaldi revision 5036 or later.

2015-03-05: Threaded decoder can now be selected at configuration time, using the use-threaded-decoder property. NB: this property should be set before other properties. Endpointing and partial results might still not work as expected with the threaded decoder.

2015-03-01: one can now optionally use the threaded online decoder. Citing Daniel Povey who developed the code in Kaldi: "This should make it possible to decode in real-time with larger models and graphs than before, because the decoding and the nnet evaluation are in separate threads and can be done in parallel." Use make CPPFLAGS=-DTHREADED_DECODER to compile it. Note that the endpointing and partial results might not work as expected with the threaded decoder, working on the fix.

2015-01-09: Added language model rescoring functionality. In order to use it, you have to specify two properties: lm-fst and big-lm-const-arpa. The lm-fst property gives the location of the original LM (the one that was used fpr compiling the HCLG.fst used during decoding). The big-lm-const-arpa property gives the location of the big LM used that is used to rescore the final lattices. The big LM must be in the 'ConstArpaLm' format, use the Kaldi's utils/build_const_arpa_lm.sh script to produce it from the ARPA format.

2014-11-11: the plugin saves the adaptation state between silence-segmented utterances and between multiple decoding sessions of the same plugin instance. That is, if you start decoding a new stream, the adaptation state of the previous stream is used (unless it's the first stream, in which case a global mean is used). Use the adaptation-state plugin property to get, and set the adaptation state. Use an empty string with the set method to reset the adaptation state to default. This functionality requires Kaldi revision 4582 or later.

HOW TO COMPILE IT

The following works on Linux (I'm using Debian 'testing'; Ubuntu 12.04 also works).

Compile Kaldi trunk (make sure you are using the latest version), using the shared configuration: In Kaldi's 'src' directory:

./configure --shared
make depend
make

Install gstreamer-1.0:

sudo apt-get install gstreamer1.0-plugins-bad  gstreamer1.0-plugins-base gstreamer1.0-plugins-good  gstreamer1.0-pulseaudio  gstreamer1.0-plugins-ugly  gstreamer1.0-tools libgstreamer1.0-dev

On Ubuntu 12.04, you'll need to use a backport ppa to get version 1.0 of gstreamer, before doing the sudo apt-get install above:

sudo add-apt-repository ppa:gstreamer-developers/ppa
sudo apt-get update

Install the Jansson library development package (version 2.7 or newer), used for encoding results as JSON:

sudo apt-get install libjansson-dev

Now we can compile this plugin. Change to src of this project:

cd src

Compile, specifying Kaldi's root directory:

KALDI_ROOT=/path/of/kaldi-trunk make depend
KALDI_ROOT=/path/of/kaldi-trunk make

This should result in 'libgstkaldionline2.so'.

Test if GStreamer can access the plugin:

GST_PLUGIN_PATH=. gst-inspect-1.0 kaldinnet2onlinedecoder

The output should list all plugin properties with their default values:

Factory Details:
  Rank                     none (0)
  Long-name                KaldiNNet2OnlineDecoder
  Klass                    Speech/Audio
  Description              Convert speech to text
[...]
  name                : The name of the object
                        flags: readable, writable
                        String. Default: "kaldinnet2onlinedecoder0"
  parent              : The parent of the object
                        flags: readable, writable
                        Object of type "GstObject"
  silent              : Silence the decoder
                        flags: readable, writable
                        Boolean. Default: false
  model               : Filename of the acoustic model
                        flags: readable, writable
                        String. Default: "models/final.mdl"
[...]
  max-nnet-batch-size : Maximum batch size we use in neural-network decodable object, in cases where we are not constrained by currently available frames (this will rarely make a difference)
                        flags: readable, writable
                        Integer. Range: -2147483648 - 2147483647 Default: 256

Element Signals:
  "partial-result" :  void user_function (GstElement* object,
                                          gchararray arg0,
                                          gpointer user_data);
  "final-result" :  void user_function (GstElement* object,
                                        gchararray arg0,
                                        gpointer user_data);

HOW TO USE IT

Command-line usage is demonstrated in demo/.

Usage through GStreamer's Python bindings is demonstrated demo/gui-demo.py and in the project https://github.com/alumae/kaldi-gstreamer-server, in file kaldigstserver/decoder2.py.

STRUCTURED RESULTS

Below is a sample of JSON-encoded full recognition results, pushed out using the full-final-result signal. This sample was generated using do-phone-alignment=true and num-nbest=10 (although due to pruning it includes only two n-best hypotheses). Note that the words in the file specified by word-syms and the phones in the file specified in phone-syms must be encoded using UTF-8, otherwise the output won't be valid JSON.

{
  "status": 0,
  "result": {
    "hypotheses": [
      {
        "transcript": "we're not ready for the next epidemic",
        "likelihood": 128.174,
        "phone-alignment": [
          {
            "start": 0,
            "phone": "SIL",
            "length": 0.71
          },
          {
            "start": 0.71,
            "phone": "W_B",
            "length": 0.18
          },
          {
            "start": 0.89,
            "phone": "ER_E",
            "length": 0.06
          },
          {
            "start": 0.95,
            "phone": "N_B",
            "length": 0.06
          },
          {
            "start": 1.01,
            "phone": "AA_I",
            "length": 0.19
          },
          {
            "start": 1.2,
            "phone": "T_E",
            "length": 0.11
          },
          {
            "start": 1.31,
            "phone": "R_B",
            "length": 0.07
          },
          {
            "start": 1.38,
            "phone": "EH_I",
            "length": 0.1
          },
          {
            "start": 1.48,
            "phone": "D_I",
            "length": 0.05
          },
          {
            "start": 1.53,
            "phone": "IY_E",
            "length": 0.22
          },
          {
            "start": 1.75,
            "phone": "SIL",
            "length": 0.46
          },
          {
            "start": 2.21,
            "phone": "F_B",
            "length": 0.1
          },
          {
            "start": 2.31,
            "phone": "ER_E",
            "length": 0.05
          },
          {
            "start": 2.36,
            "phone": "DH_B",
            "length": 0.05
          },
          {
            "start": 2.41,
            "phone": "AH_E",
            "length": 0.05
          },
          {
            "start": 2.46,
            "phone": "N_B",
            "length": 0.06
          },
          {
            "start": 2.52,
            "phone": "EH_I",
            "length": 0.11
          },
          {
            "start": 2.63,
            "phone": "K_I",
            "length": 0.08
          },
          {
            "start": 2.71,
            "phone": "S_I",
            "length": 0.05
          },
          {
            "start": 2.76,
            "phone": "T_E",
            "length": 0.07
          },
          {
            "start": 2.83,
            "phone": "EH_B",
            "length": 0.08
          },
          {
            "start": 2.91,
            "phone": "P_I",
            "length": 0.09
          },
          {
            "start": 3,
            "phone": "AH_I",
            "length": 0.04
          },
          {
            "start": 3.04,
            "phone": "D_I",
            "length": 0.08
          },
          {
            "start": 3.12,
            "phone": "EH_I",
            "length": 0.1
          },
          {
            "start": 3.22,
            "phone": "M_I",
            "length": 0.08
          },
          {
            "start": 3.3,
            "phone": "IH_I",
            "length": 0.08
          },
          {
            "start": 3.38,
            "phone": "K_E",
            "length": 0.18
          },
          {
            "start": 3.56,
            "phone": "SIL",
            "length": 0.13
          }
        ],
        "word-alignment": [
          {
            "word": "we're",
            "start": 0.71,
            "length": 0.24
          },
          {
            "word": "not",
            "start": 0.95,
            "length": 0.36
          },
          {
            "word": "ready",
            "start": 1.31,
            "length": 0.44
          },
          {
            "word": "for",
            "start": 2.21,
            "length": 0.15
          },
          {
            "word": "the",
            "start": 2.36,
            "length": 0.1
          },
          {
            "word": "next",
            "start": 2.46,
            "length": 0.37
          },
          {
            "word": "epidemic",
            "start": 2.83,
            "length": 0.73
          }
        ]
      },
      {
        "transcript": "were not ready for the next epidemic",
        "likelihood": 125.323
      }
    ]
  },
  "segment-start": 58.25,
  "segment-length": 3.69,
  "total-length": 61.94
}

CITING

If you use this software for research, you can cite the following paper (available here: http://ebooks.iospress.nl/volumearticle/37996):

@inproceedigs{alumae2014,
  author={Tanel Alum\"{a}e},
  title="Full-duplex Speech-to-text System for {Estonian}",
  booktitle="Baltic HLT 2014",
  year=2014,
  address="Kaunas, Lithuania"
}

Of course, you should also acknowledge Kaldi, which does all the hard work.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 171

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (22) 🔗