All Projects → uiuc-sst → asr24

uiuc-sst / asr24

Licence: GPL-3.0 license
24-hour Automatic Speech Recognition

Programming Languages

C++
36643 projects - #6 most used programming language
python
139335 projects - #7 most used programming language
shell
77523 projects
ruby
36898 projects - #4 most used programming language
perl
6916 projects
javascript
184084 projects - #8 most used programming language
Makefile
30231 projects

Projects that are alternatives of or similar to asr24

kaldi-long-audio-alignment
Long audio alignment using Kaldi
Stars: ✭ 21 (-22.22%)
Mutual labels:  kaldi, transcription, asr
Zamia Speech
Open tools and data for cloudless automatic speech recognition
Stars: ✭ 374 (+1285.19%)
Mutual labels:  kaldi, language-model, asr
Zeroth
Kaldi-based Korean ASR (한국어 음성인식) open-source project
Stars: ✭ 248 (+818.52%)
Mutual labels:  kaldi, language-model, asr
Pykaldi
A Python wrapper for Kaldi
Stars: ✭ 756 (+2700%)
Mutual labels:  kaldi, language-model, asr
Vosk Android Demo
Offline speech recognition for Android with Vosk library.
Stars: ✭ 271 (+903.7%)
Mutual labels:  kaldi, asr
Docker Kaldi Gstreamer Server
Dockerfile for kaldi-gstreamer-server.
Stars: ✭ 266 (+885.19%)
Mutual labels:  kaldi, asr
leopard
On-device speech-to-text engine powered by deep learning
Stars: ✭ 354 (+1211.11%)
Mutual labels:  transcription, asr
Asr theory
语音识别理论,论文和PPT
Stars: ✭ 344 (+1174.07%)
Mutual labels:  kaldi, asr
kaldi helpers
🙊 A set of scripts to use in preparing a corpus for speech-to-text processing with the Kaldi Automatic Speech Recognition Library.
Stars: ✭ 13 (-51.85%)
Mutual labels:  kaldi, transcription
Vosk Server
WebSocket, gRPC and WebRTC speech recognition server based on Vosk and Kaldi libraries
Stars: ✭ 277 (+925.93%)
Mutual labels:  kaldi, asr
Eesen
The official repository of the Eesen project
Stars: ✭ 738 (+2633.33%)
Mutual labels:  kaldi, asr
kaldi-alligner
scripts to align a given wave to its transcription using trained models by Kaldi
Stars: ✭ 24 (-11.11%)
Mutual labels:  kaldi, asr
vosk-model-ru-adaptation
No description or website provided.
Stars: ✭ 19 (-29.63%)
Mutual labels:  kaldi, asr
Pytorch Kaldi
pytorch-kaldi is a project for developing state-of-the-art DNN/RNN hybrid speech recognition systems. The DNN part is managed by pytorch, while feature extraction, label computation, and decoding are performed with the kaldi toolkit.
Stars: ✭ 2,097 (+7666.67%)
Mutual labels:  kaldi, asr
opensnips
Open source projects related to Snips https://snips.ai/.
Stars: ✭ 50 (+85.19%)
Mutual labels:  kaldi, asr
Espresso
Espresso: A Fast End-to-End Neural Speech Recognition Toolkit
Stars: ✭ 808 (+2892.59%)
Mutual labels:  kaldi, asr
Vosk Api
Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Stars: ✭ 1,357 (+4925.93%)
Mutual labels:  kaldi, asr
torchain
WIP: pytorch FFI wrapper for Kaldi chain loss (a.k.a. Lattice Free MMI)
Stars: ✭ 20 (-25.93%)
Mutual labels:  kaldi, asr
Speech To Text Russian
Проект для распознавания речи на русском языке на основе pykaldi.
Stars: ✭ 151 (+459.26%)
Mutual labels:  kaldi, asr
Pytorch Asr
ASR with PyTorch
Stars: ✭ 124 (+359.26%)
Mutual labels:  kaldi, asr

Well within 24 hours, transcribe 40 hours of recorded speech in a surprise language.

Build an ASR for a surprise language L from a pre-trained acoustic model, an L pronunciation dictionary, and an L language model. This approach converts phones directly to L words. This is less noisy than using multiple cross-trained ASRs to make English words from which phone strings are extracted, merged by PTgen, and reconstituted into L words.

A full description with performance measurements is on arXiv, and in:
M Hasegawa-Johnson, L Rolston, C Goudeseune, GA Levow, and K Kirchhoff,
Grapheme-to-phoneme transduction for cross-language ASR, Stat. Lang. Speech Proc.:3‒19, 2020.

Install software:

Kaldi

If you don't already have a version of Kaldi newer than 2016 Sep 30, get and build it following the instructions in its INSTALL files.

    git clone https://github.com/kaldi-asr/kaldi
    cd kaldi/tools; make -j $(nproc)
    cd ../src; ./configure --shared && make depend -j $(nproc) && make -j $(nproc)

brno-phnrec

Put Brno U. of Technology's phoneme recognizer next to the usual s5 directory.

    sudo apt-get install libopenblas-dev libopenblas-base
    cd kaldi/egs/aspire
    git clone https://github.com/uiuc-sst/brno-phnrec.git
    cd brno-phnrec/PhnRec
    make

This repo

Put this next to the usual s5 directory.
(The package nodejs is for ./sampa2ipa.js.)

    sudo apt-get install nodejs
    cd kaldi/egs/aspire
    git clone https://github.com/uiuc-sst/asr24.git
    cd asr24

Extension of ASpIRE

    cd kaldi/egs/aspire/asr24
    wget -qO- http://dl.kaldi-asr.org/models/0001_aspire_chain_model.tar.gz | tar xz
    steps/online/nnet3/prepare_online_decoding.sh \
      --mfcc-config conf/mfcc_hires.conf \
      data/lang_chain exp/nnet3/extractor \
      exp/chain/tdnn_7b exp/tdnn_7b_chain_online
    utils/mkgraph.sh --self-loop-scale 1.0 data/lang_pp_test \
      exp/tdnn_7b_chain_online exp/tdnn_7b_chain_online/graph_pp

In exp/tdnn_7b_chain_online this builds the files phones.txt, tree, final.mdl, conf/, etc.
This builds the subdirectories data and exp. Its last command mkgraph.sh can take 45 minutes (30 for CTVE Mandarin) and use a lot of memory because it calls fstdeterminizestar on a large language model, as Dan Povey explains.

  • Verify that it can transcribe English, in mono 16-bit 8 kHz .wav format. Either use the provided 8khz.wav, or sox MySpeech.wav -r 8000 8khz.wav, or ffmpeg -i MySpeech.wav -acodec pcm_s16le -ac 1 -ar 8000 8khz.wav.

(The scripts cmd.sh and path.sh say where to find kaldi/src/online2bin/online2-wav-nnet3-latgen-faster.)

    . cmd.sh && . path.sh
    online2-wav-nnet3-latgen-faster \
      --online=false  --do-endpointing=false \
      --frame-subsampling-factor=3 \
      --config=exp/tdnn_7b_chain_online/conf/online.conf \
      --max-active=7000 \
      --beam=15.0  --lattice-beam=6.0  --acoustic-scale=1.0 \
      --word-symbol-table=exp/tdnn_7b_chain_online/graph_pp/words.txt \
      exp/tdnn_7b_chain_online/final.mdl \
      exp/tdnn_7b_chain_online/graph_pp/HCLG.fst \
      'ark:echo utterance-id1 utterance-id1|' \
      'scp:echo utterance-id1 8khz.wav|' \
      'ark:/dev/null'

CVTE Mandarin

  • Get the Mandarin chain model (3.4 GB, about 10 minutes). This makes a subdir cvte/s5, containing a words.txt, HCLG.fst, and final.mdl.
    wget -qO- http://kaldi-asr.org/models/0002_cvte_chain_model.tar.gz | tar xz
    steps/online/nnet3/prepare_online_decoding.sh \
      --mfcc-config conf/mfcc_hires.conf \
      data/lang_chain exp/nnet3/extractor \
      exp/chain/tdnn_7b cvte/s5/exp/chain/tdnn
    utils/mkgraph.sh --self-loop-scale 1.0 data/lang_pp_test \
      cvte/s5/exp/chain/tdnn cvte/s5/exp/chain/tdnn/graph_pp

For each language L, build an ASR:

Get raw text.

  • Into $L/train_all/text put word strings in L (scraped from wherever), roughly 10 words per line, at most 500k lines. These may be quite noisy, because they'll be cleaned up.

Get a G2P.

  • Into $L/train_all/g2aspire.txt put a G2P, a few hundred lines each containing grapheme(s), whitespace, and space-delimited Aspire-style phones.
    If it has CR line terminators, convert them to standard ones in vi with %s/^M/\r/g, typing control-V before the ^M.
    If it starts with a BOM, remove it: vi -b g2aspire.txt, and just x that character away.

  • If you need to build the G2P, ./g2ipa2asr.py $L_wikipedia_symboltable.txt aspire2ipa.txt phoibletable.csv > $L/train_all/g2aspire.txt.

Build an ASR.

  • ./run.sh $L makes an L-customized HCLG.fst.
  • To instead use a prebuilt LM, ./run_from_wordlist.sh $L. See that script for usage.

Transcribe speech:

Get recordings.

On ifp-serv-03.ifp.illinois.edu, get LDC speech and convert it to a flat dir of 8 kHz .wav files:

    cd /ws/ifp-serv-03_1/workspace/fletcher/fletcher1/speech_data1/Russian/LDC2016E111/RUS_20160930
    cd /ws/ifp-serv-03_1/workspace/fletcher/fletcher1/speech_data1/Tamil/TAM_EVAL_20170601/TAM_EVAL_20170601
    cd /ws/ifp-serv-03_1/workspace/fletcher/fletcher1/speech_data1/Uzbek/LDC2016E66/UZB_20160711

    mkdir /tmp/8k
    for f in */AUDIO/*.flac; do sox "$f" -r 8000 -c 1 /tmp/8k/$(basename ${f%.*}.wav); done
    tar cf /workspace/ifp-53_1-data/eval/8k.tar -C /tmp 8k
    rm -rf /tmp/8k

For BABEL .sph files:

    cd /ws/ifp-serv-03_1/workspace/fletcher/fletcher1/speech_data1/Assamese/LDC2016E02/conversational/training/audio
    tar cf /tmp/foo.tar BABEL*.sph
    scp /tmp/foo.tar ifp-53:/tmp

On ifp-53,

    mkdir ~/kaldi/egs/aspire/asr24/$L-8khz
    cd myTmpSphDir
    tar xf /tmp/foo.tar
    for f in *.sph; do ~/kaldi/tools/sph2pipe_v2.5/sph2pipe -p -f rif "$f" /tmp/a.wav; \
        sox /tmp/a.wav -r 8000 -c 1 ~/kaldi/egs/aspire/asr24/$L-8khz/$(basename ${f%.*}.wav); done

On the host that will run the transcribing, e.g. ifp-53:

    cd kaldi/egs/aspire/asr24
    wget -qO- http://www.ifp.illinois.edu/~camilleg/e/8k.tar | tar xf -
    mv 8k $L-8khz
  • ./mkscp.rb $L-8khz $(nproc) $L splits the ASR tasks into one job per CPU core, each job with roughly the same audio duration.
    It reads $L-8khz, the dir of 8 kHz speech files.
    It makes $L-submit.sh.
  • ./$L-submit.sh launches these jobs in parallel.
  • After those jobs complete, collect the transcriptions with
    grep -h -e '^TAM_EVAL' $L/lat/*.log | sort > $L-scrips.txt (or ...^RUS_, ^BABEL_, etc.).
  • To sftp transcriptions to Jon May as elisa.tam-eng.eval-asr-uiuc.y3r1.v8.xml.gz, with timestamp June 11 and version 8,
    grep -h -e '^TAM_EVAL' tamil/lat/*.log | sort | sed -e 's/ /\t/' | ./hyp2jonmay.rb /tmp/jon-tam tam 20180611 8
    (If UTF-8 errors occur, simplify letters by appending to the sed command args such as -e 's/Ñ/N/g'.)
  • Collect each .wav file's n best transcriptions with
    cat $L/lat/*.ascii | sort > $L-nbest.txt.

Special postprocessing.

If your transcriptions used nonsense English words, convert them to phones and then, via a trie or longest common substring, into L-words:

  • ./trie-$L.rb < trie1-scrips.txt > $L-trie-scrips.txt.
  • make multicore-$L; wait; grep ... > $L-lcs-scrips.txt.

Typical results.

RUS_20160930 was transcribed in 67 minutes, 13 MB/min, 12x faster than real time.

A 3.1 GB subset of Assam LDC2016E02 was transcribed in 440 minutes, 7 MB/min, 6.5x real time. (This may have been slower because it exhausted ifp-53's memory.)

Arabic/NEMLAR_speech/NMBCN7AR, 2.2 GB (40 hours), was transcribed in 147 minutes, 14 MB/min, 16x real time. (This may have been faster because it was a few long (half-hour) files instead of many brief ones.)

TAM_EVAL_20170601 was transcribed in 45 minutes, 21 MB/min, 19x real time.

Generating lattices $L/lat/* took 1.04x longer for Russian, 0.93x longer(!) for Arabic, 1.7x longer for Tamil.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].