All Projects → yc9701 → Pansori

yc9701 / Pansori

Licence: mit
Tools for ASR Corpus Generation from Online Video

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Pansori

opensource-voice-tools
A repo listing known open source voice tools, ordered by where they sit in the voice stack
Stars: ✭ 21 (-80.19%)
Mutual labels:  corpus, speech-recognition
megs
A merged version of multiple open-source German speech datasets.
Stars: ✭ 21 (-80.19%)
Mutual labels:  corpus, speech-recognition
Cross vc
Cross-lingual Voice Conversion
Stars: ✭ 91 (-14.15%)
Mutual labels:  speech-recognition
Kaldi Gop
Computes the GMM-based Goodness of Pronunciation (GOP). Bases on Kaldi.
Stars: ✭ 104 (-1.89%)
Mutual labels:  speech-recognition
Vosk Api
Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Stars: ✭ 1,357 (+1180.19%)
Mutual labels:  speech-recognition
Ai Study
人工智能学习资料超全整理,包含机器学习基础ML、深度学习基础DL、计算机视觉CV、自然语言处理NLP、推荐系统、语音识别、图神经网路、算法工程师面试题
Stars: ✭ 93 (-12.26%)
Mutual labels:  speech-recognition
Openseq2seq
Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP
Stars: ✭ 1,378 (+1200%)
Mutual labels:  speech-recognition
Pyclue
Python toolkit for Chinese Language Understanding(CLUE) Evaluation benchmark
Stars: ✭ 91 (-14.15%)
Mutual labels:  corpus
Self Supervised Speech Recognition
speech to text with self-supervised learning based on wav2vec 2.0 framework
Stars: ✭ 106 (+0%)
Mutual labels:  speech-recognition
Audiomate
Python library for handling audio datasets.
Stars: ✭ 99 (-6.6%)
Mutual labels:  speech-recognition
Wav2letter.pytorch
A fully convolution-network for speech-to-text, built on pytorch.
Stars: ✭ 104 (-1.89%)
Mutual labels:  speech-recognition
Factorized Tdnn
PyTorch implementation of the Factorized TDNN (TDNN-F) from "Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks" and Kaldi
Stars: ✭ 98 (-7.55%)
Mutual labels:  speech-recognition
Chi Corpus
迟先生语料库
Stars: ✭ 96 (-9.43%)
Mutual labels:  corpus
Speech And Text
Speech to text (PocketSphinx, Iflytex API, Baidu API) and text to speech (pyttsx3) | 语音转文字(PocketSphinx、百度 API、科大讯飞 API)和文字转语音(pyttsx3)
Stars: ✭ 102 (-3.77%)
Mutual labels:  speech-recognition
Ktspeechcrawler
Automatically constructing corpus for automatic speech recognition from YouTube videos
Stars: ✭ 92 (-13.21%)
Mutual labels:  speech-recognition
Delta
DELTA is a deep learning based natural language and speech processing platform.
Stars: ✭ 1,479 (+1295.28%)
Mutual labels:  speech-recognition
Deep Learning Drizzle
Drench yourself in Deep Learning, Reinforcement Learning, Machine Learning, Computer Vision, and NLP by learning from these exciting lectures!!
Stars: ✭ 9,717 (+9066.98%)
Mutual labels:  speech-recognition
Lexicon Thai
คลังศัพท์ภาษาไทย
Stars: ✭ 96 (-9.43%)
Mutual labels:  corpus
Pubmed Rct
PubMed 200k RCT dataset: a large dataset for sequential sentence classification.
Stars: ✭ 101 (-4.72%)
Mutual labels:  corpus
Bigcidian
Pronunciation lexicon covering both English and Chinese languages for Automatic Speech Recognition.
Stars: ✭ 99 (-6.6%)
Mutual labels:  speech-recognition

Pansori

Pansori is a program for creating an automatic speech recognition (ASR) corpus from online videos with audio and subtitle data.

Overview

alt text

It consists of 4 pipeline stages as shown in the diagram above: ingest, align, transform and validate.

Ingest

Online video contents consist of multiple media streams for different screen resolutions and audio-only playback; hand-transcribed subtitle information can also be retrieved if available. Pansori downloads the audio and subtitle streams from online videos as mp4 and srt files, respectively.

Align

The subtitles contain segmented text and timing information which corresponds to the audio contents of the associated video. With the timing information, it is possible to segment the audio stream to make a matching pair of audio and text fragments for an ASR corpus.

However, inaccuracies can be introduced to the segmented contents because the timing information might be determined not only by audio contents but also by scene changes in the video. In addition, they can also arise from unintentional slicing of audio stream at word boundaries in fast speeches and when substantial ambient noise such as applause is present. To fix these inaccuracies, we used finetuneas, a GUI tool to help find correct alignment between audio and text. We are currently moving to a fully automated forced alignment approach in order to further simplify this stage.

Transform

The aligned audio stream and subtitle data are then processed with the following transformations specific to data types:

  • Audio stream: segmentation, lossless compression
  • Subtitle data: normalization, punctuation removal, removal of non-speech text (such as the description of audience response or ambient noise)

Validate

Although the audio stream and subtitle data are force-aligned with each other, there are also inherent discrepancies between the two. This can come from one or more of the following: inaccurate transcriptions, ambiguous pronunciations, and non-ideal audio conditions (like ambient noise or poor recording quality). To increase the quality of the corpus, the corpus needs to be refined by filtering out inaccurate audio and subtitle pairs.

Previous approaches relied on custom ASR models for corpus validation and refinement; however, they are not easily created for many languages, especially for those without existing corpora. In Pansori, we used a new approach through a cloud-based ASR; we chose the Google Cloud Speech-to-Text API since it provides the highest quality ASR services in more than 120 languages. Cloud services make the development of corpus generation much faster and easier since we can just set up the cloud service rather than create custom ASR engines with acoustic and language models in different languages.


The program can be modified for use in videos subtitled in any language available in the Google API.

Installation

Clone repository:

$ git clone https://github.com/yc9701/pansori

Install pytube, a library for downloading YouTube videos.

$ pip install pytube

Install pysubs2, a library for editing subtitle files. *Currently, pysubs2 runs only with Python 3.6; on Python 3.7, this library does not work

$ pip install pysubs2

Install pydub, a library for manipulating audio. *Only necessary if wishing for audio playback when validating audio

$ pip install pydub

The Google Cloud Speech API is also required for validate.py (an account is required).

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].