All Projects → yh1008 → speech-to-text

yh1008 / speech-to-text

Licence: other
mixlingual speech recognition system; hybrid (GMM+NNet) model; Kaldi + Keras

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language
shell
77523 projects
perl
6916 projects

Projects that are alternatives of or similar to speech-to-text

Dragonfire
the open-source virtual assistant for Ubuntu based Linux distributions
Stars: ✭ 1,120 (+1736.07%)
Mutual labels:  speech-recognition, speech-to-text, kaldi
Eesen
The official repository of the Eesen project
Stars: ✭ 738 (+1109.84%)
Mutual labels:  speech-recognition, speech-to-text, kaldi
kaldi-long-audio-alignment
Long audio alignment using Kaldi
Stars: ✭ 21 (-65.57%)
Mutual labels:  speech-recognition, speech-to-text, kaldi
Pytorch Kaldi
pytorch-kaldi is a project for developing state-of-the-art DNN/RNN hybrid speech recognition systems. The DNN part is managed by pytorch, while feature extraction, label computation, and decoding are performed with the kaldi toolkit.
Stars: ✭ 2,097 (+3337.7%)
Mutual labels:  dnn, speech-recognition, kaldi
Speech To Text Russian
Проект для распознавания речи на русском языке на основе pykaldi.
Stars: ✭ 151 (+147.54%)
Mutual labels:  speech-recognition, speech-to-text, kaldi
Vosk Api
Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Stars: ✭ 1,357 (+2124.59%)
Mutual labels:  speech-recognition, speech-to-text, kaldi
Awesome Kaldi
This is a list of features, scripts, blogs and resources for better using Kaldi ( http://kaldi-asr.org/ )
Stars: ✭ 393 (+544.26%)
Mutual labels:  speech-recognition, speech-to-text, kaldi
Kaldi
kaldi-asr/kaldi is the official location of the Kaldi project.
Stars: ✭ 11,151 (+18180.33%)
Mutual labels:  speech-recognition, speech-to-text, kaldi
Kaldi Active Grammar
Python Kaldi speech recognition with grammars that can be set active/inactive dynamically at decode-time
Stars: ✭ 196 (+221.31%)
Mutual labels:  speech-recognition, speech-to-text, kaldi
kaldi ag training
Docker image and scripts for training finetuned or completely personal Kaldi speech models. Particularly for use with kaldi-active-grammar.
Stars: ✭ 14 (-77.05%)
Mutual labels:  speech-recognition, speech-to-text, kaldi
Unity live caption
Use Google Speech-to-Text API to do real-time live stream caption on Unity! Best when combined with your virtual character!
Stars: ✭ 26 (-57.38%)
Mutual labels:  speech-recognition, speech-to-text
open-speech-corpora
💎 A list of accessible speech corpora for ASR, TTS, and other Speech Technologies
Stars: ✭ 841 (+1278.69%)
Mutual labels:  speech-recognition, speech-to-text
srvk-eesen-offline-transcriber
Top level code to transcribe English audio/video files into text/subtitles
Stars: ✭ 22 (-63.93%)
Mutual labels:  speech-recognition, kaldi
spokestack-ios
Spokestack: give your iOS app a voice interface!
Stars: ✭ 27 (-55.74%)
Mutual labels:  speech-recognition, speech-to-text
kaldi helpers
🙊 A set of scripts to use in preparing a corpus for speech-to-text processing with the Kaldi Automatic Speech Recognition Library.
Stars: ✭ 13 (-78.69%)
Mutual labels:  speech-to-text, kaldi
Chinese-automatic-speech-recognition
Chinese speech recognition
Stars: ✭ 147 (+140.98%)
Mutual labels:  speech-recognition, speech-to-text
deepspeech.mxnet
A MXNet implementation of Baidu's DeepSpeech architecture
Stars: ✭ 82 (+34.43%)
Mutual labels:  speech-recognition, speech-to-text
voce-browser
Voice Controlled Chromium Web Browser
Stars: ✭ 34 (-44.26%)
Mutual labels:  speech-recognition, speech-to-text
scripty
Speech to text bot for Discord using Mozilla's DeepSpeech
Stars: ✭ 14 (-77.05%)
Mutual labels:  speech-recognition, speech-to-text
deep avsr
A PyTorch implementation of the Deep Audio-Visual Speech Recognition paper.
Stars: ✭ 104 (+70.49%)
Mutual labels:  speech-recognition, speech-to-text

Mixlingual Speech Recognition

From the team:

As Chinese students studying in the states, we found our speaking habits morphed -- English words and phrases easily get slipped into Chinese sentences. We greatly feel the need to have messaging apps that can handle multilingual speech-to-text translation. So in this task, we are going to develop this function -- build a model using deep learning architecture(DNN, CNN, LSTM) to corretly translate multilingual audio (having Chinese and English in the same sentence) into text.

- Video Demo

Table of Content:

Directory Description

codeswitch:

Contains scripts to build our system

description:

LDC2015S04, our dataset description

notes:

Our study notes on Kaldi related recipie, including timit and librispeech

Resources to Build the System

Data Source:

Baseline Model Paper:

Other Code-switching related Paper:

Feature Improvement related Paper:

Interesting Python Kaldi Wrapper to be examined:

Kaldi recommended recipe to be examined:

Kaldi resources:

Data Preperation:

filename: pattern: format: path: source:
acoustic data: spk2gender <speakerID><gender> /data/train /data/test handmade
utt2spk <utteranceID><speakerID> /data/train /data/test handmade
wav.scp <utteranceID><full_path_to_audio_file> .scp: kaldi script file /data/train /data/test handmade
text <utteranceID><full_path_to_audio_file> .ark: kaldi archive file /data/train /data/test exists
language data: lexicon.txt <word> <phone 1><phone 2> ... .ark: kaldi archive file data/local/dict egs/voxforge
nonsilence_phones.txt  <phone> data/local/dict unkown
silence_phones.txt  <phone> data/local/dict unkown
optional_silence.txt  <phone> data/local/dict unkown
Tools: utils  / kaldi/egs/wsj/s5
steps / kaldi/egs/wsj/s5
score.sh  / kaldi/egs/voxforge/s5/local 

Language Model:

What are our language model:
3-grams trained from the transcripts of THCHS30 + LDC2015S04

directory structure taken from /egs/TIMIT/s5:

/data
  /local
    /nist_lm
      /lm_phone_bg.arpa.gz

How to build a language model:

Kaldi script utils/prepare_lang.sh

usage: utils/prepare_lang.sh <dict-src-dir> <oov-dict-entry> <tmp-dir> <lang-dir>
e.g.: utils/prepare_lang.sh data/local/dict <SPOKEN_NOISE> data/local/lang data/lang
options:
     --num-sil-states <number of states>             # default: 5, #states in silence models.
     --num-nonsil-states <number of states>          # default: 3, #states in non-silence models.
     --position-dependent-phones (true|false)        # default: true; if true, use _B, _E, _S & _I
                                                     # markers on phones to indicate word-internal positions.
     --share-silence-phones (true|false)             # default: false; if true, share pdfs of
                                                     # all non-silence phones.
     --sil-prob <probability of silence>             # default: 0.5 [must have 0 < silprob < 1]

Turning the –share-silence-phones option to TRUE was extremely helpful for the Cantonese data of IARPA's BABEL project, where the data is very messy and has long untranscribed portions that the Kaldi developers try to align to a special phone that is designated for that purpose. The --sil-prob might be another potentially important option.

Preparation

  • lexicon.txt
    • The pronunciation dictionary where every line is a word with its phonemic pronunciation. It Only contains words and their pronunciations that are present in the corpus.
    • ENG: CMU dictionary
  • nonsilence_phones.txt
  • optional_silence.txt
  • silence_phones.txt

MFCC Feature Extraction:

   echo
   echo "===== FEATURES EXTRACTION ====="
   echo
 
   # Making feats.scp files
   mfccdir=mfcc
   # Uncomment and modify arguments in scripts below if you have any problems with data sorting
   # utils/validate_data_dir.sh data/train     # script for checking prepared data - here: for data/train directory
   # utils/fix_data_dir.sh data/train          # tool for data proper sorting if needed - here: for data/train directory
   steps/make_mfcc.sh --nj $nj --cmd "$train_cmd" data/train exp/make_mfcc/train $mfccdir
   steps/make_mfcc.sh --nj $nj --cmd "$train_cmd" data/test exp/make_mfcc/test $mfccdir
  
   # Making cmvn.scp files
   steps/compute_cmvn_stats.sh data/train exp/make_mfcc/train $mfccdir
   steps/compute_cmvn_stats.sh data/test exp/make_mfcc/test $mfccdir

MFCC-related documents

HMM - GMM

Reference

a as the transition probability from state i to state j
b as the emission probability from state j to sequence X

Forward-backward algorithm fine tunes a

GMM providesb

HMM solves the following three problems:

  1. overall likelihood (Forward algorithm): determine the likelihood of an observation sequence X=(x1, x2, ... xT) being generated by an HMM
  2. training (Forward-backward algorithm EM): given an observation sequence, learn the best lambda
  3. decoding (Viterbi algorithm): given an on observation sequence, determine the most probable hidden state sequence

CNN and MFSC features

In order to train CNN, we need to extract MFSC features from the acoustic data instead of MFCC features, as Discrete Cosine Transformation (DCT) in MFCC destroys locality. MFSC features also called filter banks. In Kaldi, the scripts are something like the following:

steps/make_fbank.sh --nj 3 \ $trainDir/train_clean_fbank exp/make_fbank/train_clean_fbank feat/fbank/ || exit 1;
steps/compute_cmvn_stats.sh $trainDir/train_clean_fbank exp/make_fbank/train_clean_fbank feat/fbank/ || exit 1;

notice that fbanks don't work well with GMM as fbanks features are highly correlated, and GMM modelled with diagonal covariance matrices assumed independence of feature streams. fbanks/MFSC is okay with DNN, best for CNN.
why MFSC+GMM produced high WER-see Kaldi discussion
why DCT destroys locality-see post

Required Packages

tensorflow == 1.1.0
theano == 0.9.0.dev-c697eeab84e5b8a74908da654b66ec9eca4f1291
keras == 1.2

Run Kaldi on single GPU

This doesn't require Sun GridEngine. Simply download [CUDA toolkit] (https://developer.nvidia.com/cuda-downloads), install it with

sudo sh cuda_8.0.61_375.26_linux.run

and then go under kaldi/src execute

./configure

to check if it detects CUDA, you will also find CUDA = true in kaldi/src/kaldi.mk then recompile Kaldi with

make -j 8 # 8 for 8-core cpu
make depend -j 8 # 8 for 8-core cpu

Noted that GMM-based training and decode is not supported by GPU, only nnet does. source

** if you are using AWS g2.2xlarge, and launched the instance before 2017-04-18 (when this note is written), its NVIDIA may need a legacy 367.x driver, the default (latest) driver that comes with CUDA-8 cuda_8.0.61_375.26_linux.run will fail. To check the current version of the driver installed on the instance, type

apt-cache search nvidia | grep -P '^nvidia-[0-9]+\s'

to install a version of your choice from the list, type

sudo apt-get install nvidia-367

You can also download a specifc version from the web, for example NVIDIA-Linux-x86_64-367.18.run. Install it with

sudo sh NVIDIA-Linux-x86_64-367.18.run

and then when installing cuda_8.0.61_375.26_linux.run, it will ask you whether to install NVIDIA driver 375, make sure you choose no.

Install tensorflow-gpu

Required:

  1. install CUDA toolkit 8.0 as of 04-18-2017
  2. install cuDNN download v5, as of 04-18-2017, Tensorflow performs the best with cuDNN 5.x
    Follow commands carefully from the Tensorflow website. After intallation, you can test if tensorflow can detect your gpu by typing the following:
# makes sure you are out of the tensorflow git repo
python
>>> import tensorflow as tf
>>> sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

A working tensorflow will output:

I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:04.0
Total memory: 11.17GiB
Free memory: 11.11GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0)
Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:04.0
I tensorflow/core/common_runtime/direct_session.cc:257] Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:04.0

  1. During testing, if you run into error like:
I tensorflow/stream_executor/dso_loader.cc:126] Couldn't open CUDA library libcudnn.so.5. LD_LIBRARY_PATH: /usr/local/cuda/lib64
I tensorflow/stream_executor/cuda/cuda_dnn.cc:3517] Unable to load cuDNN DSO

from the writer's experience, you didn't set the right LD_LIBRARY_PATH in the ~/.profile file. You need to examine where is libcudnn.so.5 located and move it to the desired location, most likely it will be /usr/local/cuda. Also make sure you type source ~/.profile to activate the change, after you modify the file.

  1. If you are testing it in a python shell, and you met the following error:
ImportError: libcudart.so.8.0: cannot open shared object file: No such file or directory

very likely you are in the actual tensorflow git repo. source, make sure you jump out of it before testing.

Install Theano GPU

Keras-kaldi's LSTM training script breaks under the current tensorflow (as tensorflow went through series of API changes during the previous months), we need to install Theano GPU and switch to the theano backend for running run_kt_LSTM.sh.
After installing Theano-gpu using miniconda, in order to modify the theano.config file, you can create .theanorc by the following command:

echo -e "\n[global]\nfloatX=float32\n" >> ~/.theanorc

and add device=gpu to the this file. If theano can't detect NVCC, by giving you the following error:

ERROR (theano.sandbox.cuda): nvcc compiler not found on $PATH. Check your nvcc installation and try again.

(but you sure that you installed CUDA), you can solve it by adding the following lines to ~/.profile:

export PATH=/usr/local/cuda-8.0/bin/:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64:$LD_LIBRARY_PATH

don't forget to source ~/.profile to enable the change.
to change the keras backend from tensorflow to theano, modify:

vim $HOME/.keras/keras.json

to test if theano is indeed using gpu, execute the following file:

from theano import function, config, shared, tensor
import numpy
import time
vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], tensor.exp(x))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in range(iters):
    r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, tensor.Elemwise) and
              ('Gpu' not in type(x.op).__name__)
              for x in f.maker.fgraph.toposort()]):
    print('Used the cpu')
else:
    print('Used the gpu')

Kaldi script to train nnet

  1. 3-4 hours to train, 3 hours to decode on GPU:
    local/online/run_nnet2_baseline.sh

Chinese CER (Character Error Rate)

  1. egs/hkust/s5/local/ext/score.sh

Keras-Kaldi

dspavankumar/keras-kaldi github repo
Up to the time that we ran his code, the enviornment is still Keras 1.2.0 Make sure that the Keras version is the same across the machines. to reinstall Keras from 2.0.3 to older version, type

$ sudo pip3 install keras==1.2
or 
$ conda install keras==1.2.2 # if you are using conda

If there is version inconsistency (train model using 1.2.0 but decode it with 2.0.3, you will run into problem when loading an existing model:

  File "steps_kt/nnet-forward.py", line 33, in <module>
    m = keras.models.load_model (model)
  File "/usr/local/lib/python3.5/dist-packages/keras/models.py", line 281, in load_model
    Error: “Optimizer weight shape (1024, ) not compatible with provided weight shape (429,1024)”

source

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].