pytorch-kaldi is a project for developing state-of-the-art DNN/RNN hybrid speech recognition systems. The DNN part is managed by pytorch, while feature extraction, label computation, and decoding are performed with the kaldi toolkit.

Stars: ✭ 2,097 (+14878.57%)

Mutual labels: speech, speech-recognition, kaldi

Lingvo

Stars: ✭ 2,361 (+16764.29%)

Mutual labels: speech, speech-recognition, speech-to-text

Edgedict

Working online speech recognition based on RNN Transducer. ( Trained model release available in release )

Stars: ✭ 205 (+1364.29%)

Mutual labels: speech, speech-recognition, speech-to-text

Syn Speech

Syn.Speech is a flexible speaker independent continuous speech recognition engine for Mono and .NET framework

Stars: ✭ 57 (+307.14%)

Mutual labels: speech, speech-recognition, speech-to-text

Discordspeechbot

A speech-to-text bot for discord with music commands and more using NodeJS. Ideally for controlling your Discord server using voice commands, can also be useful for hearing-impaired people.

Stars: ✭ 35 (+150%)

Mutual labels: speech, speech-recognition, speech-to-text

Pytorch Asr

ASR with PyTorch

Stars: ✭ 124 (+785.71%)

Mutual labels: speech, speech-recognition, kaldi

Pykaldi

A Python wrapper for Kaldi

Stars: ✭ 756 (+5300%)

Mutual labels: speech, speech-recognition, kaldi

Annyang

💬 Speech recognition for your site

Stars: ✭ 6,216 (+44300%)

Mutual labels: speech, speech-recognition, speech-to-text

anycontrol

Voice control for your websites and applications

Stars: ✭ 53 (+278.57%)

Mutual labels: speech, speech-recognition, speech-to-text

Java Speech Api

The J.A.R.V.I.S. Speech API is designed to be simple and efficient, using the speech engines created by Google to provide functionality for parts of the API. Essentially, it is an API written in Java, including a recognizer, synthesizer, and a microphone capture utility. The project uses Google services for the synthesizer and recognizer. While this requires an Internet connection, it provides a complete, modern, and fully functional speech API in Java.

Stars: ✭ 490 (+3400%)

Mutual labels: speech, speech-recognition, speech-to-text

Sonus

💬 /so.nus/ STT (speech to text) for Node with offline hotword detection

Stars: ✭ 532 (+3700%)

Mutual labels: speech, speech-recognition, speech-to-text

Tacotron asr

Speech Recognition Using Tacotron

Stars: ✭ 165 (+1078.57%)

Mutual labels: speech, speech-recognition, speech-to-text

View All Similar Projects ➔

Kaldi AG Training Setup

Docker image and scripts for training finetuned or completely personal Kaldi speech models. Particularly for use with kaldi-active-grammar.

Usage

All commands are run in the Docker container as follows. Training on the CPU should work, just much more slowly. To do so, remove the --runtime=nvidia and use the image daanzu/kaldi_ag_training:2020-11-28 instead the GPU image. You can run Docker directly with the following parameter structure, or as a shortcut, use the run_docker.sh script (and edit it to suit your needs and configuration).

docker run -it --rm -v $(pwd):/mnt/input -w /mnt/input --user "$(id -u):$(id -g)" \
    --runtime=nvidia daanzu/kaldi_ag_training_gpu:2020-11-28 \
    [command and args...]

Example commands:

# Download and prepare base model (needed for either finetuning or personal model training)
wget https://github.com/daanzu/kaldi_ag_training/releases/download/v0.1.0/kaldi_model_daanzu_20200905_1ep-mediumlm-base.zip
unzip kaldi_model_daanzu_20200905_1ep-mediumlm-base.zip

# Prepare training dataset files
python3 convert_tsv_to_scp.py yourdata.tsv [optional output directory]

# Pick only one of the following:
# Run finetune training, with default settings
bash run_docker.sh bash run.finetune.sh kaldi_model_daanzu_20200905_1ep-mediumlm-base dataset
# Run completely personal training, with default settings
bash run_docker.sh bash run.personal.sh kaldi_model_daanzu_20200905_1ep-mediumlm-base dataset

# When training completes, export trained model
python3 export_trained_model.py {finetune,personal} [optional output directory]
# Finally run the following in your kaldi-active-grammar python environment (will take as much as an hour and several GB of RAM)
python3 -m kaldi_active_grammar compile_agf_dictation_graph -v -m [model_dir]

# Test a new or old model
python3 test_model.py testdata.tsv [model_dir]

Notes

To run either training, you must have a base model to use as a template. (For finetuning this is also the starting point of the model; for personal it is only a source of basic info.) You can use this base model from this project's release page. Download the zip file and extract it to the root directory of this repo, so the directory kaldi_model_daanzu_20200905_1ep-mediumlm-base is here.
Kaldi requires the training data metadata to be in the SCP format, which is an annoying multi-file format. To convert the standard KaldiAG TSV format to SCP, you can run python3 convert_tsv_to_scp.py yourdata.tsv dataset to output SCP format in a new directory dataset. You can run these commands within the Docker container, or directly using your own python environment.
- Even better, run python3 convert_tsv_to_scp.py -l kaldi_model_daanzu_20200905_1ep-mediumlm-base/dict/lexicon.txt yourdata.tsv dataset to filter out utterances containing out-of-vocabulary words. OOV words are not currently well supported by these training scripts.
The audio data should be 16-bit Signed Integer PCM 1-channel 16kHz WAV files. Note that it needs to be accessible within the Docker container, so it can't be behind a symlink that points outside this repo directory, which is shared with the Docker container.
There are some directory names you should avoid using in this repo directory, because the scripts will create & use them during training. Avoid: conf, data, exp, extractor, mfcc, steps, tree_sp, utils.
Training may use a lot of storage. You may want to locate this directory somewhere with ample room available.
The training commands (run.*.sh) accept many optional parameters. More info later.
- --stage n : Skip to given stage.
- --num-utts-subset 3000 : You may need this parameter to prevent an error at the beginning of nnet training if your training data contains many short (command-like) utterances. (3000 is a perhaps overly careful suggestion; 300 is the default value.)
I decided to try to treat the docker image as evergreen, and keep the things liable to change a lot like scripts in the git repo instead.
The format of the training dataset input .tsv file is of tab-separated-values fields as follows: wav_filename ignored ignored ignored text_transcript

Related Repositories

daanzu/speech-training-recorder: Simple GUI application to help record audio dictated from given text prompts, for use with training speech recognition or speech synthesis.
daanzu/kaldi-active-grammar: Python Kaldi speech recognition with grammars that can be set active/inactive dynamically at decode-time.

License

This project is licensed under the GNU Affero General Public License v3 (AGPL-3.0-or-later). See the LICENSE file for details. If this license is problematic for you, please contact me.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

daanzu / kaldi_ag_training

Programming Languages

Labels

Projects that are alternatives of or similar to kaldi ag training

Kaldi AG Training Setup

Usage

Notes

Related Repositories

License