All Projects → daemon → pytorch-pcen

daemon / pytorch-pcen

Licence: MIT license
PyTorch reimplementation of per-channel energy normalization for audio.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to pytorch-pcen

Gcc Nmf
Real-time GCC-NMF Blind Speech Separation and Enhancement
Stars: ✭ 231 (+188.75%)
Mutual labels:  speech
browser-apis
🦄 Cool & Fun Browser Web APIs 🥳
Stars: ✭ 21 (-73.75%)
Mutual labels:  speech
Multimodal-Gesture-Recognition-with-LSTMs-and-CTC
An end-to-end system that performs temporal recognition of gesture sequences using speech and skeletal input. The model combines three networks with a CTC output layer that recognises gestures from continuous stream.
Stars: ✭ 25 (-68.75%)
Mutual labels:  speech
Tacotron pytorch
PyTorch implementation of Tacotron speech synthesis model.
Stars: ✭ 242 (+202.5%)
Mutual labels:  speech
Voice Gender
Gender recognition by voice and speech analysis
Stars: ✭ 248 (+210%)
Mutual labels:  speech
idear
🎙️ Handsfree Audio Development Interface
Stars: ✭ 84 (+5%)
Mutual labels:  speech
Source separation
Deep learning based speech source separation using Pytorch
Stars: ✭ 226 (+182.5%)
Mutual labels:  speech
txt2speech
Convert text to speech using Google Translate API
Stars: ✭ 38 (-52.5%)
Mutual labels:  speech
lectures-all
Central repository for all lectures on deep learning at UPC ETSETB TelecomBCN.
Stars: ✭ 46 (-42.5%)
Mutual labels:  speech
IMS-Toucan
Text-to-Speech Toolkit of the Speech and Language Technologies Group at the University of Stuttgart. Objectives of the development are simplicity, modularity, controllability and multilinguality.
Stars: ✭ 295 (+268.75%)
Mutual labels:  speech
Kerasdeepspeech
A Keras CTC implementation of Baidu's DeepSpeech for model experimentation
Stars: ✭ 245 (+206.25%)
Mutual labels:  speech
Wavegrad
Implementation of Google Brain's WaveGrad high-fidelity vocoder (paper: https://arxiv.org/pdf/2009.00713.pdf). First implementation on GitHub.
Stars: ✭ 245 (+206.25%)
Mutual labels:  speech
VQMIVC
Official implementation of VQMIVC: One-shot (any-to-any) Voice Conversion @ Interspeech 2021 + Online playing demo!
Stars: ✭ 278 (+247.5%)
Mutual labels:  speech
Lhotse
Stars: ✭ 236 (+195%)
Mutual labels:  speech
TF-Speech-Recognition-Challenge-Solution
Source code of the model used in Tensorflow Speech Recognition Challenge (https://www.kaggle.com/c/tensorflow-speech-recognition-challenge). The solution ranked in top 5% in private leaderboard.
Stars: ✭ 58 (-27.5%)
Mutual labels:  speech
Setk
Tools for Speech Enhancement integrated with Kaldi
Stars: ✭ 227 (+183.75%)
Mutual labels:  speech
Naver-AI-Hackathon-Speech
2019 Clova AI Hackathon : Speech - Rank 12 / Team Kai.Lib
Stars: ✭ 26 (-67.5%)
Mutual labels:  speech
wav2vec2-live
A live speech recognition using Facebooks wav2vec 2.0 model.
Stars: ✭ 205 (+156.25%)
Mutual labels:  speech
anycontrol
Voice control for your websites and applications
Stars: ✭ 53 (-33.75%)
Mutual labels:  speech
react-native-speech-bubble
💬 A speech bubble dialog component for React Native.
Stars: ✭ 50 (-37.5%)
Mutual labels:  speech

PyTorch-PCEN

Efficient PyTorch reimplementation of per-channel energy normalization with Mel spectrogram features.

Overview

Robustness to loudness differences in near- and far-field conditions is critical in high-quality speech recognition applications. Obviously, spectrogram energies differ significantly between, say, shouting at arms-length and whispering from a distance. This can worsen model quality, since the model itself would need to be robust across a wide range of input. The log-compression step in the popular log-Mel transform partially addresses this issue by reducing the dynamic range of audio; however, it ignores per-channel energy differences and is static by definition.

Per-channel energy normalization is one such solution to the aforementioned problems. It provides a per-channel, trainable front-end in place of the log compression, greatly improving model robustness in keyword spotting systems -- all the while being resource-efficient and easy to implement.

Installation and Usage

  1. PyTorch and NumPy are required. LibROSA and matplotlib are required only for the example.
  2. To install via pip, run pip install git+https://github.com/daemon/pytorch-pcen. Otherwise, clone this repository and run python setup.py install.
  3. To run the example in the module, place a 16kHz WAV file named yes.wav in the current directory. Then, do python -m pcen.pcen.

The following is a self-contained example for using a streaming PCEN layer:

import pcen
import torch

# 40-dimensional features, 30-millisecond window, 10-millisecond shift; trainable is false by default
transform = pcen.StreamingPCENTransform(n_mels=40, n_fft=480, hop_length=160, trainable=True)
audio = torch.empty(1, 16000).normal_(0, 0.1) # Gaussian noise

# 1600 is an arbitrary chunk size; This step is unnecessary but demonstrates the streaming nature
streaming_chunks = audio.split(1600, 1)
pcen_chunks = [transform(chunk) for chunk in streaming_chunks] # Transform each chunk
transform.reset() # Reset the persistent streaming state
pcen_ = torch.cat(pcen_chunks, 1)

Citation

Wang, Yuxuan, Pascal Getreuer, Thad Hughes, Richard F. Lyon, and Rif A. Saurous. Trainable frontend for robust and far-field keyword spotting. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pp. 5670-5674. IEEE, 2017.

@inproceedings{wang2017trainable,
  title={Trainable frontend for robust and far-field keyword spotting},
  author={Wang, Yuxuan and Getreuer, Pascal and Hughes, Thad and Lyon, Richard F and Saurous, Rif A},
  booktitle={Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on},
  pages={5670--5674},
  year={2017},
  organization={IEEE}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].