All Projects → Picovoice → Speech To Text Benchmark

Picovoice / Speech To Text Benchmark

Licence: apache-2.0
speech to text benchmark framework

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Speech To Text Benchmark

Vosk Api
Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Stars: ✭ 1,357 (+182.12%)
Mutual labels:  deep-neural-networks, speech-recognition, speech-to-text, voice-recognition, privacy, offline
Cheetah
On-device streaming speech-to-text engine powered by deep learning
Stars: ✭ 383 (-20.37%)
Mutual labels:  speech-recognition, speech-to-text, voice-recognition, offline
KeenASR-Android-PoC
A proof-of-concept app using KeenASR SDK on Android. WE ARE HIRING: https://keenresearch.com/careers.html
Stars: ✭ 21 (-95.63%)
Mutual labels:  offline, voice-recognition, speech-recognition, speech-to-text
picovoice
The end-to-end platform for building voice products at scale
Stars: ✭ 316 (-34.3%)
Mutual labels:  offline, voice-recognition, speech-recognition
Hey Jetson
Deep Learning based Automatic Speech Recognition with attention for the Nvidia Jetson.
Stars: ✭ 161 (-66.53%)
Mutual labels:  deep-neural-networks, speech-recognition, speech-to-text
leopard
On-device speech-to-text engine powered by deep learning
Stars: ✭ 354 (-26.4%)
Mutual labels:  voice-recognition, speech-recognition, speech-to-text
Vosk
VOSK Speech Recognition Toolkit
Stars: ✭ 182 (-62.16%)
Mutual labels:  speech-recognition, speech-to-text, voice-recognition
leon
🧠 Leon is your open-source personal assistant.
Stars: ✭ 8,560 (+1679.63%)
Mutual labels:  offline, speech-recognition, speech-to-text
octopus
On-device speech-to-index engine powered by deep learning.
Stars: ✭ 30 (-93.76%)
Mutual labels:  voice-recognition, speech-recognition, speech-to-text
Deepspeech
DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
Stars: ✭ 18,680 (+3783.58%)
Mutual labels:  speech-recognition, speech-to-text, offline
spokestack-ios
Spokestack: give your iOS app a voice interface!
Stars: ✭ 27 (-94.39%)
Mutual labels:  voice-recognition, speech-recognition, speech-to-text
open-speech-corpora
💎 A list of accessible speech corpora for ASR, TTS, and other Speech Technologies
Stars: ✭ 841 (+74.84%)
Mutual labels:  voice-recognition, speech-recognition, speech-to-text
voce-browser
Voice Controlled Chromium Web Browser
Stars: ✭ 34 (-92.93%)
Mutual labels:  voice-recognition, speech-recognition, speech-to-text
Kur
Descriptive Deep Learning
Stars: ✭ 811 (+68.61%)
Mutual labels:  deep-neural-networks, speech-recognition, speech-to-text
react-native-spokestack
Spokestack: give your React Native app a voice interface!
Stars: ✭ 53 (-88.98%)
Mutual labels:  voice-recognition, speech-recognition, speech-to-text
Voice Overlay Android
🗣 An overlay that gets your user’s voice permission and input as text in a customizable UI
Stars: ✭ 189 (-60.71%)
Mutual labels:  speech-recognition, speech-to-text, voice-recognition
Nativescript Speech Recognition
💬 Speech to text, using the awesome engines readily available on the device.
Stars: ✭ 72 (-85.03%)
Mutual labels:  speech-recognition, speech-to-text, voice-recognition
Spokestack Python
Spokestack is a library that allows a user to easily incorporate a voice interface into any Python application.
Stars: ✭ 103 (-78.59%)
Mutual labels:  speech-recognition, speech-to-text, voice-recognition
AmazonSpeechTranslator
End-to-end Solution for Speech Recognition, Text Translation, and Text-to-Speech for iOS using Amazon Translate and Amazon Polly as AWS Machine Learning managed services.
Stars: ✭ 50 (-89.6%)
Mutual labels:  voice-recognition, speech-recognition, speech-to-text
Voice Overlay Ios
🗣 An overlay that gets your user’s voice permission and input as text in a customizable UI
Stars: ✭ 440 (-8.52%)
Mutual labels:  speech-recognition, speech-to-text, voice-recognition

Speech-to-Text Benchmark

Made in Vancouver, Canada by Picovoice

This is a minimalist and extensible framework for benchmarking different speech-to-text engines. It has been developed and tested on Ubuntu 18.04 (x86_64) using Python3.6.

Table of Contents

Background

This framework has been developed by Picovoice as part of the Cheetah project. Cheetah is Picovoice's streaming speech-to-text engine, specifically designed to run efficiently on the edge (offline). Deep learning has been the main driver in recent improvements in speech recognition but due to stringent compute/storage limitations of IoT platforms, it is mostly beneficial to cloud-based engines. Picovoice's proprietary deep learning technology enables transferring these improvements to IoT platforms with significantly lower CPU/memory footprint.

Data

LibriSpeech dataset is used for benchmarking. We use the test-clean portion.

Metrics

This benchmark considers three metrics: word error rate, real-time factor, and model size.

Word Error Rate

Word error rate (WER) is defined as the ratio of Levenstein distance between words in a reference transcript and words in the output of the speech-to-text engine, to the number of words in the reference transcript.

Real Time Factor

Real time factor (RTF) is measured as the ratio of CPU (processing) time to the length of the input speech file. A speech-to-text engine with lower RTF is more computationally efficient. We omit this metric for cloud-based engines.

Model Size

The aggregate size of models (acoustic and language), in MB. We omit this metric for cloud-based engines.

Speech-to-Text Engines

Amazon Transcribe

Amazon Transcribe is a cloud-based speceh recognition engine, offered by AWS. Find more information here.

CMU PocketSphinx

PocketSphinx works offline and can run on embedded platforms such as Raspberry Pi.

Google Speech-to-Text

A cloud-based speech recognition engine offered by Google Cloud Platform. Find more information here.

Mozilla DeepSpeech

Mozilla DeepSpeech is an open-source implementation of Baidu's DeepSpeech by Mozilla.

Picovoice Cheetah

Cheetah is a streaming speech-to-text engine developed using Picovoice's proprietary deep learning technology. It works offline and is supported on a growing number of platforms including Android, iOS, and Raspberry Pi.

Picovoice Leopard

Leopard is a speech-to-text engine developed using Picovoice's proprietary deep learning technology. It works offline and is supported on a growing number of platforms including Android, iOS, and Raspberry Pi.

Usage

Below is information on how to use this framework to benchmark the speech-to-text engines.

  1. Make sure that you have installed DeepSpeech and PocketSphinx on your machine by following the instructions on their official pages.
  2. Unpack DeepSpeech's models under resources/deepspeech.
  3. Download the test-clean portion of LibriSpeech and unpack it under resources/data.
  4. For running Google Speech-to-Text and Amazon Transcribe, you need to sign up for the respective cloud provider and setup permissions / credentials according to their documentation. Running these services may incur fees.

Word Error Rate Measurement

Word Error Rate can be measured by running the following command from the root of the repository:

python benchmark.py --engine_type AN_ENGINE_TYPE

The valid options for the engine_type parameter are: AMAZON_TRANSCRIBE, CMU_POCKET_SPHINX, GOOGLE_SPEECH_TO_TEXT, MOZILLA_DEEP_SPEECH, PICOVOICE_CHEETAH, PICOVOICE_CHEETAH_LIBRISPEECH_LM, PICOVOICE_LEOPARD, and PICOVOICE_LEOPARD_LIBRISPEECH_LM.

PICOVOICE_CHEETAH_LIBRISPEECH_LM is the same as PICOVOICE_CHEETAH except that the language model is trained on LibriSpeech training text similar to Mozilla DeepSpeech. The same applies to Leopard.

Real Time Factor Measurement

The time command is used to measure the execution time of different engines for a given audio file, and then divide the CPU time by audio length. To measure the execution time for Cheetah, run:

time resources/cheetah/cheetah_demo \
resources/cheetah/libpv_cheetah.so \
resources/cheetah/acoustic_model.pv \
resources/cheetah/language_model.pv \
resources/cheetah/cheetah_eval_linux.lic \
PATH_TO_WAV_FILE

The output should have the following format (values may be different):

real	0m4.961s
user	0m4.936s
sys	0m0.024s

Then, divide the user value by the length of the audio file, in seconds. The user value is the actual CPU time spent in the program.

To measure the execution time for Leopard, run:

time resources/leopard/leopard_demo \
resources/leopard/libpv_leopard.so \
resources/leopard/acoustic_model.pv \
resources/leopard/language_model.pv \
resources/leopard/leopard_eval_linux.lic \
PATH_TO_WAV_FILE

For DeepSpeech:

time deepspeech \
--model resources/deepspeech/output_graph.pbmm \
--lm resources/deepspeech/lm.binary \
--trie resources/deepspeech/trie \
--audio PATH_TO_WAV_FILE

Finally, for PocketSphinx:

time pocketsphinx_continuous -infile PATH_TO_WAV_FILE

Results

The below results are obtained by following the previous steps. The benchmarking was performed on a Linux machine running Ubuntu 18.04 with 64GB of RAM and an Intel i5-6500 CPU running at 3.2 GHz. WER refers to word error rate and RTF refers to real time factor.

Engine WER RTF (Desktop) RTF (Raspberry Pi 3) RTF (Raspberry Pi Zero) Model Size (Acoustic and Language)
Amazon Transcribe 8.21% N/A N/A N/A N/A
CMU PocketSphinx (0.1.15) 31.82% 0.32 1.87 2.04 97.8 MB
Google Speech-to-Text 12.23% N/A N/A N/A N/A
Mozilla DeepSpeech (0.6.1) 7.55% 0.46 N/A N/A 1146.8 MB
Picovoice Cheetah (v1.2.0) 10.49% 0.04 0.62 3.11 47.9 MB
Picovoice Cheetah LibriSpeech LM (v1.2.0) 8.25% 0.04 0.62 3.11 45.0 MB
Picovoice Leopard (v1.0.0) 8.34% 0.02 0.55 2.55 47.9 MB
Picovoice Leopard LibriSpeech LM (v1.0.0) 6.58% 0.02 0.55 2.55 45.0 MB

The figure below compares the word error rate of speech-to-text engines. For Picovoice, we included the engine that was trained on LibriSpeech training data similar to Mozilla DeepSpeech.

The figure below compares accuracy and runtime metrics of offline speech-to-text engines. For Picovoice we included the engines that were trained on LibriSpeech training data similar to Mozilla DeepSpeech. Leopard achieves the highest accuracy while being 23X faster and 27X smaller in size compared to second most accurate engine (Mozilla DeepSpeech).

License

The benchmarking framework is freely available and can be used under the Apache 2.0 license. The provided Cheetah and Leopard resources (binary, model, and license file) are the property of Picovoice. They are only to be used for evaluation purposes and their use in any commercial product is strictly prohibited.

For commercial enquiries contact us via this form.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].