Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → ryanleary → Patter

ryanleary / Patter

Licence: mit

speech-to-text in pytorch

Programming Languages

139335 projects - #7 most used programming language

Labels

pytorch ocr speech-recognition rnn speech-to-text

Projects that are alternatives of or similar to Patter

Recurrent Neural Network and Long Short Term Memory (LSTM) with Connectionist Temporal Classification implemented in Theano. Includes a Toy training example.

Stars: ✭ 220 (+209.86%)

Mutual labels: speech-recognition, rnn, speech-to-text, ocr

Adapt Intent Parser

Stars: ✭ 690 (+871.83%)

Mutual labels: speech-recognition, speech-to-text

the open-source virtual assistant for Ubuntu based Linux distributions

Stars: ✭ 1,120 (+1477.46%)

Mutual labels: speech-recognition, speech-to-text

Stephanie is an open-source platform built specifically for voice-controlled applications as well as to automate daily tasks imitating much of an virtual assistant's work.

Stars: ✭ 772 (+987.32%)

Mutual labels: speech-recognition, speech-to-text

Silero Models: pre-trained STT models and benchmarks made embarrassingly simple

Stars: ✭ 522 (+635.21%)

Mutual labels: speech-recognition, speech-to-text

💬 /so.nus/ STT (speech to text) for Node with offline hotword detection

Stars: ✭ 532 (+649.3%)

Mutual labels: speech-recognition, speech-to-text

The official repository of the Eesen project

Stars: ✭ 738 (+939.44%)

Mutual labels: speech-recognition, speech-to-text

Voice Overlay Ios

🗣 An overlay that gets your user’s voice permission and input as text in a customizable UI

Stars: ✭ 440 (+519.72%)

Mutual labels: speech-recognition, speech-to-text

Discordspeechbot

A speech-to-text bot for discord with music commands and more using NodeJS. Ideally for controlling your Discord server using voice commands, can also be useful for hearing-impaired people.

Stars: ✭ 35 (-50.7%)

Mutual labels: speech-recognition, speech-to-text

A voice control - voice commands - speech recognition and speech synthesis javascript library. Create your own siri,google now or cortana with Google Chrome within your website.

Stars: ✭ 1,011 (+1323.94%)

Mutual labels: speech-recognition, speech-to-text

A pytorch based end2end speech recognition system.

Stars: ✭ 69 (-2.82%)

Mutual labels: speech-recognition, speech-to-text

Java Speech Api

The J.A.R.V.I.S. Speech API is designed to be simple and efficient, using the speech engines created by Google to provide functionality for parts of the API. Essentially, it is an API written in Java, including a recognizer, synthesizer, and a microphone capture utility. The project uses Google services for the synthesizer and recognizer. While this requires an Internet connection, it provides a complete, modern, and fully functional speech API in Java.

Stars: ✭ 490 (+590.14%)

Mutual labels: speech-recognition, speech-to-text

Speech To Text Benchmark

speech to text benchmark framework

Stars: ✭ 481 (+577.46%)

Mutual labels: speech-recognition, speech-to-text

Speech recognition

Speech recognition module for Python, supporting several engines and APIs, online and offline.

Stars: ✭ 5,999 (+8349.3%)

Mutual labels: speech-recognition, speech-to-text

语音api示例

Stars: ✭ 454 (+539.44%)

Mutual labels: speech-recognition, speech-to-text

💬 Speech recognition for your site

Stars: ✭ 6,216 (+8654.93%)

Mutual labels: speech-recognition, speech-to-text

Audio Pretrained Model

A collection of Audio and Speech pre-trained models.

Stars: ✭ 61 (-14.08%)

Mutual labels: speech-recognition, speech-to-text

On-device speech-to-intent engine powered by deep learning

Stars: ✭ 406 (+471.83%)

Mutual labels: speech-recognition, speech-to-text

Asrt speechrecognition

A Deep-Learning-Based Chinese Speech Recognition System 基于深度学习的中文语音识别系统

Stars: ✭ 4,943 (+6861.97%)

Mutual labels: speech-recognition, speech-to-text

Descriptive Deep Learning

Stars: ✭ 811 (+1042.25%)

Mutual labels: speech-recognition, speech-to-text

View All Similar Projects ➔

patter

speech-to-text framework in PyTorch with initial support for the DeepSpeech2 architecture (and variants of it).

Features

File-based configuration of corpora definitions, model architecture, and training configuration for repeatability
DeepSpeech model is highly configurable
- Various RNN types (RNN, LSTM, GRU) and sizes (layers/hidden units)
- Various activation functions (Clipped ReLU, Swish)
- Forward-only RNN with Lookahead (for streaming) or Bidirectional RNN
- Configurable CNN frontend
- Optional batchnorm
- Optional RNN weight noise
Beam decoder with KenLM support
Dataset augmentation with support for:
- speed perturbations
- gain perturbations
- shift (in time) perturbations
- noise addition (at random SNR)
- impulse response perturbations
Tensorboard integration
gRPC-based model server

Installation

Manual installation of two dependencies is required:

SeanNaren/warp-ctc and the pytorch binding included within the repo
parlance/ctcdecode CTC beam decoder enabling language model support

Once these dependencies are installed, patter can be installed by simply running python setup.py install. For debugging and development purposes, patter can instead be installed with python setup.py develop.

Dataset Definitions

Datasets for patter are defined using json-lines files with newline-separated json objects. Each link contains a json object which defines an utterance's audio path, transcription path, and duration in seconds.

{"audio_filepath": "/path/to/utterance1.wav", "text_filepath": "/path/to/utterance1.txt", "duration": 23.147}
{"audio_filepath": "/path/to/utterance2.wav", "text_filepath": "/path/to/utterance2.txt", "duration": 18.251}

Training

Patter includes a top-level trainer script which calls to underlying library methods for training. To use the built-in command-line trainer, three files must be defined: corpus configuration, model configuration, and training configuration. Examples for each are provided below.

Corpus Configuration

A corpus configuration file is used to specify the train and validation sets in the corpus as well as any augmentation that should occur to the audio. See the example configuration below for further documentation on the options.

# Filter the audio configured in the `datasets` below to be within min and max duration. Remove min or max (or both) to
# do no filtering
min_duration = 1.0
max_duration = 17.0

# Link to manifest files (as described above) of the training and validation sets. A future release will allow multiple
# files to be specified for merging corpora on the fly. If `augment` is true, each audio will be passed through the 
# augmentation pipeline specified below. Valid names for the datasets are in the set ["train", "val"]
[[dataset]]
name = "train"
manifest = "/path/to/corpora/train.json"
augment = true

[[dataset]]
name = "val"
manifest = "/path/to/corpora/val.json"
augment = false


# Optional augmentation pipeline. If specified, audio from a dataset with the augment flag set to true will be passed
# through each augmentation, in order. Each augmentation must minimally specify the type and a probability. The 
# probability indicates that the augmentation will run on a given audio file with that probability

# The noise augmentation mixes audio from a dataset of noise files with a random SNR drawn from within the range specified.
[[augmentation]]
type = "noise"
prob = 0.0
[augmentation.config]
manifest = "/path/to/noise_manifest.json"
min_snr_db = 3
max_snr_db = 35

# The impulse augmentation applies a random impulse response drawn from the manifest to the audio 
[[augmentation]]
type = "impulse"
prob = 0.0
[augmentation.config]
manifest = "/path/to/impulse_manifest.json"

# The speed augmentation applies a random speed perturbation without altering pitch
[[augmentation]]
type = "speed"
prob = 1.0
[augmentation.config]
min_speed_rate = 0.95
max_speed_rate = 1.05

# The shift augmentation simply adds a random amount of silence to the audio or removes some of the initial audio
[[augmentation]]
type = "shift"
prob = 1.0
[augmentation.config]
min_shift_ms = -5
max_shift_ms = 5

# The gain augmentation modifies the gain of the audio by a fixed amount randomly chosen within the specified range
[[augmentation]]
type = "gain"
prob = 1.0
[augmentation.config]
min_gain_dbfs = -10
max_gain_dbfs = 10

Model Configuration

At this time, patter supports only variants of the DeepSpeech 2 and DeepSpeech 3 (same as DS2 w/o BatchNorm + Weight Noise) architectures. Future model architectures including novel architectures may be included in future releases. To configure the architecture and hyperparameters, define the model a configuration TOML. See example:

# model class - only DeepSpeechOptim currently
model = "DeepSpeechOptim"

# define input features/windowing. Currently only STFT is supported, but window is configurable.
[input]
type = "stft"
normalize = true
sample_rate = 16000
window_size = 0.02
window_stride = 0.01
window = "hamming"

# Define layers of [2d CNN -> Activation -> Optional BatchNorm] as a frontend
[[cnn]]
filters = 32
kernel = [41, 11]
stride = [2, 2]
padding = [0, 10]
batch_norm = true
activation = "hardtanh"
activation_params = [0, 20]

[[cnn]]
filters = 32
kernel = [21, 11]
stride = [2, 1]
padding = [0, 2]
batch_norm = true
activation = "hardtanh"
activation_params = [0, 20]

# Configure the RNN. Currently LSTM, GRU, and RNN are supported. QRNN will be added for forward-only models in a future release
[rnn]
type = "lstm"
bidirectional = true
size = 512
layers = 4
batch_norm = true

# DS3 suggests using weight noise instead of batch norm, only set when rnn batch_norm = false
#[rnn.noise]
#mean=0.0
#std=0.001

# only used/necessary when rnn bidirectional = false
#[context]
#context = 20
#activation = "swish"

# Set of labels for model to predict. Specifying a label for the CTC 'blank' symbol is not required and handled automatically
[labels]
labels = [
  "'", "A", "B", "C", "D", "E", "F", "G", "H",
  "I", "J", "K", "L", "M", "N", "O", "P", "Q",
  "R", "S", "T", "U", "V", "W", "X", "Y", "Z", " ",
]

Trainer Configuration

The trainer configuration file includes metadata about the model to be created, where to store models, logs, tensorboard logs, etc, in addition to NN trainer configuration.

# give the trained model a name
id = "expt-name"
cuda = true

[output]
model_path = "/path/to/best/model.pt"
log_path = "/path/to/tensorboard/logs"

[trainer]
epochs = 20
batch_size = 32
num_workers = 4
max_norm = 400

[trainer.optimizer]
# Currently SGD and Adam are supported
optimizer = "sgd"
lr = 3e-4
momentum = 0.9
anneal = 0.85

Testing

A patter-test script is provided for doing evaluations of a trained model. It takes as arguments a testing configuration and a trained model.

cuda = true
batch_size = 10
num_workers = 4

[[dataset]]
name = "test"
manifest = "/path/to/manifests/test.jl"
augment = false

[decoder]
algorithm = "greedy" # or "beam"
workers = 4

# If `beam` is specified as the decoder type, the below is used to initialize the beam decoder
[decoder.beam]
beam_width = 30
cutoff_top_n = 40
cutoff_prob = 1.0

# If "beam" is specified and you want to use a language model, configure the ARPA or KenLM format LM and alpha/beta weights
[decoder.beam.lm]
lm_path = "/path/to/language/model.arpa"
alpha = 2.15
beta = 0.35

Acknowledgements

Huge thanks to SeanNaren whose work on deepspeech.pytorch is leveraged heavily in this project.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 71

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (12) 🔗