Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → yc9701 → Pansori

yc9701 / Pansori

Licence: mit

Tools for ASR Corpus Generation from Online Video

Programming Languages

python

139335 projects - #7 most used programming language

Labels

speech-recognition corpus

Projects that are alternatives of or similar to Pansori

opensource-voice-tools

A repo listing known open source voice tools, ordered by where they sit in the voice stack

Stars: ✭ 21 (-80.19%)

Mutual labels: corpus, speech-recognition

megs

A merged version of multiple open-source German speech datasets.

Stars: ✭ 21 (-80.19%)

Mutual labels: corpus, speech-recognition

Cross vc

Cross-lingual Voice Conversion

Stars: ✭ 91 (-14.15%)

Mutual labels: speech-recognition

Kaldi Gop

Computes the GMM-based Goodness of Pronunciation (GOP). Bases on Kaldi.

Stars: ✭ 104 (-1.89%)

Mutual labels: speech-recognition

Vosk Api

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node

Stars: ✭ 1,357 (+1180.19%)

Mutual labels: speech-recognition

Ai Study

人工智能学习资料超全整理，包含机器学习基础ML、深度学习基础DL、计算机视觉CV、自然语言处理NLP、推荐系统、语音识别、图神经网路、算法工程师面试题

Stars: ✭ 93 (-12.26%)

Mutual labels: speech-recognition

Openseq2seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP

Stars: ✭ 1,378 (+1200%)

Mutual labels: speech-recognition

Pyclue

Python toolkit for Chinese Language Understanding(CLUE) Evaluation benchmark

Stars: ✭ 91 (-14.15%)

Mutual labels: corpus

Self Supervised Speech Recognition

speech to text with self-supervised learning based on wav2vec 2.0 framework

Stars: ✭ 106 (+0%)

Mutual labels: speech-recognition

Audiomate

Python library for handling audio datasets.

Stars: ✭ 99 (-6.6%)

Mutual labels: speech-recognition

Wav2letter.pytorch

A fully convolution-network for speech-to-text, built on pytorch.

Stars: ✭ 104 (-1.89%)

Mutual labels: speech-recognition

Factorized Tdnn

PyTorch implementation of the Factorized TDNN (TDNN-F) from "Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks" and Kaldi

Stars: ✭ 98 (-7.55%)

Mutual labels: speech-recognition

Chi Corpus

迟先生语料库

Stars: ✭ 96 (-9.43%)

Mutual labels: corpus

Speech And Text

Speech to text (PocketSphinx, Iflytex API, Baidu API) and text to speech (pyttsx3) | 语音转文字（PocketSphinx、百度 API、科大讯飞 API）和文字转语音（pyttsx3）

Stars: ✭ 102 (-3.77%)

Mutual labels: speech-recognition

Ktspeechcrawler

Automatically constructing corpus for automatic speech recognition from YouTube videos

Stars: ✭ 92 (-13.21%)

Mutual labels: speech-recognition

Delta

DELTA is a deep learning based natural language and speech processing platform.

Stars: ✭ 1,479 (+1295.28%)

Mutual labels: speech-recognition

Deep Learning Drizzle

Drench yourself in Deep Learning, Reinforcement Learning, Machine Learning, Computer Vision, and NLP by learning from these exciting lectures!!

Stars: ✭ 9,717 (+9066.98%)

Mutual labels: speech-recognition

Lexicon Thai

คลังศัพท์ภาษาไทย

Stars: ✭ 96 (-9.43%)

Mutual labels: corpus

Pubmed Rct

PubMed 200k RCT dataset: a large dataset for sequential sentence classification.

Stars: ✭ 101 (-4.72%)

Mutual labels: corpus

Bigcidian

Pronunciation lexicon covering both English and Chinese languages for Automatic Speech Recognition.

Stars: ✭ 99 (-6.6%)

Mutual labels: speech-recognition

View All Similar Projects ➔

Pansori

Pansori is a program for creating an automatic speech recognition (ASR) corpus from online videos with audio and subtitle data.

Overview

It consists of 4 pipeline stages as shown in the diagram above: ingest, align, transform and validate.

Ingest

Online video contents consist of multiple media streams for different screen resolutions and audio-only playback; hand-transcribed subtitle information can also be retrieved if available. Pansori downloads the audio and subtitle streams from online videos as mp4 and srt files, respectively.

Align

The subtitles contain segmented text and timing information which corresponds to the audio contents of the associated video. With the timing information, it is possible to segment the audio stream to make a matching pair of audio and text fragments for an ASR corpus.

However, inaccuracies can be introduced to the segmented contents because the timing information might be determined not only by audio contents but also by scene changes in the video. In addition, they can also arise from unintentional slicing of audio stream at word boundaries in fast speeches and when substantial ambient noise such as applause is present. To fix these inaccuracies, we used finetuneas, a GUI tool to help find correct alignment between audio and text. We are currently moving to a fully automated forced alignment approach in order to further simplify this stage.

Transform

The aligned audio stream and subtitle data are then processed with the following transformations specific to data types:

Audio stream: segmentation, lossless compression
Subtitle data: normalization, punctuation removal, removal of non-speech text (such as the description of audience response or ambient noise)

Validate

Although the audio stream and subtitle data are force-aligned with each other, there are also inherent discrepancies between the two. This can come from one or more of the following: inaccurate transcriptions, ambiguous pronunciations, and non-ideal audio conditions (like ambient noise or poor recording quality). To increase the quality of the corpus, the corpus needs to be refined by filtering out inaccurate audio and subtitle pairs.

Previous approaches relied on custom ASR models for corpus validation and refinement; however, they are not easily created for many languages, especially for those without existing corpora. In Pansori, we used a new approach through a cloud-based ASR; we chose the Google Cloud Speech-to-Text API since it provides the highest quality ASR services in more than 120 languages. Cloud services make the development of corpus generation much faster and easier since we can just set up the cloud service rather than create custom ASR engines with acoustic and language models in different languages.

The program can be modified for use in videos subtitled in any language available in the Google API.

Installation

Clone repository:

$ git clone https://github.com/yc9701/pansori

Install pytube, a library for downloading YouTube videos.

$ pip install pytube

Install pysubs2, a library for editing subtitle files. *Currently, pysubs2 runs only with Python 3.6; on Python 3.7, this library does not work

$ pip install pysubs2

Install pydub, a library for manipulating audio. *Only necessary if wishing for audio playback when validating audio

$ pip install pydub

The Google Cloud Speech API is also required for validate.py (an account is required).

Installation and use guide

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 106

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗