All Projects → mailong25 → Self Supervised Speech Recognition

mailong25 / Self Supervised Speech Recognition

speech to text with self-supervised learning based on wav2vec 2.0 framework

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Self Supervised Speech Recognition

Vosk
VOSK Speech Recognition Toolkit
Stars: ✭ 182 (+71.7%)
Mutual labels:  speech-recognition, speech-to-text, semi-supervised-learning
Openasr
A pytorch based end2end speech recognition system.
Stars: ✭ 69 (-34.91%)
Mutual labels:  speech-recognition, speech-to-text
Dragonfire
the open-source virtual assistant for Ubuntu based Linux distributions
Stars: ✭ 1,120 (+956.6%)
Mutual labels:  speech-recognition, speech-to-text
Vosk Api
Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Stars: ✭ 1,357 (+1180.19%)
Mutual labels:  speech-recognition, speech-to-text
Mongolian Speech Recognition
Mongolian speech recognition with PyTorch
Stars: ✭ 97 (-8.49%)
Mutual labels:  speech-recognition, speech-to-text
Audio Pretrained Model
A collection of Audio and Speech pre-trained models.
Stars: ✭ 61 (-42.45%)
Mutual labels:  speech-recognition, speech-to-text
Nativescript Speech Recognition
💬 Speech to text, using the awesome engines readily available on the device.
Stars: ✭ 72 (-32.08%)
Mutual labels:  speech-recognition, speech-to-text
Susi
SuSi: Python package for unsupervised, supervised and semi-supervised self-organizing maps (SOM)
Stars: ✭ 42 (-60.38%)
Mutual labels:  unsupervised-learning, semi-supervised-learning
Deepspeech
A PaddlePaddle implementation of ASR.
Stars: ✭ 1,219 (+1050%)
Mutual labels:  speech-recognition, speech-to-text
Deepspeech Websocket Server
Server & client for DeepSpeech using WebSockets for real-time speech recognition in separate environments
Stars: ✭ 79 (-25.47%)
Mutual labels:  speech-recognition, speech-to-text
Wav2letter.pytorch
A fully convolution-network for speech-to-text, built on pytorch.
Stars: ✭ 104 (-1.89%)
Mutual labels:  speech-recognition, speech-to-text
Speech And Text
Speech to text (PocketSphinx, Iflytex API, Baidu API) and text to speech (pyttsx3) | 语音转文字(PocketSphinx、百度 API、科大讯飞 API)和文字转语音(pyttsx3)
Stars: ✭ 102 (-3.77%)
Mutual labels:  speech-recognition, speech-to-text
Syn Speech
Syn.Speech is a flexible speaker independent continuous speech recognition engine for Mono and .NET framework
Stars: ✭ 57 (-46.23%)
Mutual labels:  speech-recognition, speech-to-text
Angle
⦠ Angle: new speakable syntax for python 💡
Stars: ✭ 61 (-42.45%)
Mutual labels:  speech-recognition, speech-to-text
Artyom.js
A voice control - voice commands - speech recognition and speech synthesis javascript library. Create your own siri,google now or cortana with Google Chrome within your website.
Stars: ✭ 1,011 (+853.77%)
Mutual labels:  speech-recognition, speech-to-text
Patter
speech-to-text in pytorch
Stars: ✭ 71 (-33.02%)
Mutual labels:  speech-recognition, speech-to-text
Openseq2seq
Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP
Stars: ✭ 1,378 (+1200%)
Mutual labels:  speech-recognition, speech-to-text
Kur
Descriptive Deep Learning
Stars: ✭ 811 (+665.09%)
Mutual labels:  speech-recognition, speech-to-text
Discordspeechbot
A speech-to-text bot for discord with music commands and more using NodeJS. Ideally for controlling your Discord server using voice commands, can also be useful for hearing-impaired people.
Stars: ✭ 35 (-66.98%)
Mutual labels:  speech-recognition, speech-to-text
Wav2letter
Speech Recognition model based off of FAIR research paper built using Pytorch.
Stars: ✭ 78 (-26.42%)
Mutual labels:  speech-recognition, speech-to-text

Self-supervised speech recognition with limited amount of labeled data

This is a wrapper version of wav2vec 2.0 framework, which attempts to build an accurate speech recognition models with small amount of transcribed data (eg. 1 hour)

Transfer learning is still the main technique:

  • Transfer from self-supervised models (pretrain on unlabeled data)
  • Transfer from multilingual models (pretrain on multilingual data)

Required resources

1. Labeled data, which is pairs of (audio, transcript)

The more you have, the better the model is. Prepare at least 1 hour if you have a large amount of unlabeled data. Otherwise, at least 50 hours is recommended.

2. Text data for building language models.

This should includes both well-written text and conversational text, which can easily collected from news/forums websties. At least 1 GB of text is recommended.

3. Unlabeled data (audios without transcriptions) of your own language.

This is optional but very crucial. A good amount of unlabeled audios (eg. 500 hours) will significantly reduce the amount of labeled data needed, and also boost up the model performance. Youtube/Podcast is a great place to collect the data for your own language

Install instruction

Please follow this instruction

Steps to build an accurate speech recognition model for your language

1. Train a self-supervised model on unlabeled data (Pretrain)

1.1 Prepare unlabeled audios

Collect unlabel audios and put them all together in a single directory. Audio format requirements:
Format: wav, PCM 16 bit, single channel
Sampling_rate: 16000
Length: 5 to 30 seconds
Content: silence should be removed from the audio. Also, each audio should contain only one person speaking.
Please look at examples/unlabel_audio directory for reference.

1.2 Download an initial model

Instead of training from scratch, we download and use english wav2vec model for weight initialization. This pratice can be apply to all languages.

wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_small.pt

1.3 Run Pre-training

python3 pretrain.py --fairseq_path path/to/libs/fairseq --audio_path path/to/audio_directory --init_model path/to/wav2vec_small.pt

Where:

  • fairseq_path: path to installed fairseq library, after install instruction
  • audio_path: path to unlabel audio directory
  • init_model: downloaded model from step 1.2

Logs and checkpoints will be stored at outputs directory
Log_file path: outputs/date_time/exp_id/hydra_train.log. You should check the loss value to decide when to stop the training process.
Best_checkpoint path: outputs/date_time/exp_id/checkpoints/checkpoint_best.pt
In my casse, it took ~ 4 days for the model to converge, train on 100 hours of data using 2 NVIDIA Tesla V100.

2. Finetune the self-supervised model on the labeled data

2.1 Prepare labeled data

-- Transcript file ---
One trainng sample per line with format "audio_absolute_path \tab transcript"
Example of a transcript file:

/path/to/1.wav AND IT WAS A MATTER OF COURSE THAT IN THE MIDDLE AGES WHEN THE CRAFTSMEN
/path/to/2.wav AND WAS IN FACT THE KIND OF LETTER USED IN THE MANY SPLENDID MISSALS PSALTERS PRODUCED BY PRINTING IN THE FIFTEENTH CENTURY
/path/to/3.wav JOHN OF SPIRES AND HIS BROTHER VINDELIN FOLLOWED BY NICHOLAS JENSON BEGAN TO PRINT IN THAT CITY
/path/to/4.wav BEING THIN TOUGH AND OPAQUE

Some notes on transcript file:

  • One sample per line
  • Upper case
  • All numbers should be transformed into verbal form.
  • All special characters (eg. punctuation) should be removed. The final text should contain words only
  • Words in a sentence must be separated by whitespace character

-- Labeled audio file ---
Format: wav, PCM 16 bit, single channel, Sampling_rate: 16000.
Silence should be removed from the audio.
Also, each audio should contain only one person speaking.\

2.2 Generate dictionary file

python3 gen_dict.py --transcript_file path/to/transcript.txt --save_dir path/to/save_dir

The dictionary file will be stored at save_dir/dict.ltr.txt. Use the file for fine-tuning and inference.

2.3 Run Fine-tuning on the pretrain model

python3 finetune.py --transcript_file path/to/transcript.txt --pretrain_model path/to/pretrain_checkpoint_best.pt --dict_file path/to/dict.ltr.txt

Where:

  • transcript_file: path to transcript file from step 2.1
  • pretrain_model: path to best model checkpoint from step 1.3
  • dict_file: dictionary file generated from step 2.2

Logs and checkpoints will be stored at outputs directory
Log_file path: outputs/date_time/exp_id/hydra_train.log. You should check the loss value to decide when to stop the training process.
Best_checkpoint path: outputs/date_time/exp_id/checkpoints/checkpoint_best.pt
In my casse, it took ~ 12 hours for the model to converge, train on 100 hours of data using 2 NVIDIA Tesla V100.

3. Train a language model

3.1 Prepare text corpus

Collect all texts and put them all together in a single file.
Text file format:

  • One sentence per line
  • Upper case
  • All numbers should be transformed into verbal form.
  • All special characters (eg. punctuation) should be removed. The final text should contain words only
  • Words in a sentence must be separated by whitespace character

Example of a text corpus file for English case:

AND IT WAS A MATTER OF COURSE THAT IN THE MIDDLE AGES WHEN THE CRAFTSMEN
AND WAS IN FACT THE KIND OF LETTER USED IN THE MANY SPLENDID MISSALS PSALTERS PRODUCED BY PRINTING IN THE FIFTEENTH CENTURY
JOHN OF SPIRES AND HIS BROTHER VINDELIN FOLLOWED BY NICHOLAS JENSON BEGAN TO PRINT IN THAT CITY
BEING THIN TOUGH AND OPAQUE
...

Example of a text corpus file for Chinese case:

每 个 人 都 有 他 的 作 战 策 略 直 到 脸 上 中 了 一 拳
这 是 我 年 轻 时 候 住 的 房 子 。
这 首 歌 使 我 想 起 了 我 年 轻 的 时 候 。
...

3.2 Train the language model

python3 train_lm.py --kenlm_path path/to/libs/kenlm --transcript_file path/to/transcript.txt --additional_file path/to/text_corpus.txt --ngram 3 --output_path ./lm

Where:

  • kenlm_path: path to installed kenlm library, after install instruction
  • transcript_file: path to transcript file from step 2.1
  • additional_file: path to text corpus file from step 3.1

The LM model and the lexicon file will be stored at output_path

4. Make prediction on multiple audios programmatically

from stt import Transcriber
transcriber = Transcriber(pretrain_model = 'path/to/pretrain.pt', finetune_model = 'path/to/finetune.pt', 
                          dictionary = 'path/to/dict.ltr.txt',
                          lm_type = 'kenlm',
                          lm_lexicon = 'path/to/lm/lexicon.txt', lm_model = 'path/to/lm/lm.bin',
                          lm_weight = 1.5, word_score = -1, beam_size = 50)
hypos = transcriber.transcribe(['path/to/wavs/0_1.wav','path/to/wavs/0_2.wav'])
print(hypos)

Where:

  • pretrain_model: path to best pretrain checkpoint from step 1.3
  • finetune_model: path to best fine-tuned checkpoint from step 2.3
  • dictionary: dictionary file generated from step 2.2
  • lm_lexicon and lm_model: generated from step 3.2

Note: If you running inference in a juyter notebook. Please add these lines above the inference script:

import sys
sys.argv = ['']

Pre-trained models (Pretrain + Fine-tune + LM)

Older version on Vietnamese speech recognition:

https://github.com/mailong25/self-supervised-speech-recognition/tree/vietnamese

Reference:

Paper: wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations: https://arxiv.org/abs/2006.11477
Source code: https://github.com/pytorch/fairseq/blob/master/examples/wav2vec/README.md

License

MIT

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].