All Projects → BrightGu → SingleVC

BrightGu / SingleVC

Licence: other
Any-to-one voice conversion using the data augment strategy: pitch shifted and duration remained.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to SingleVC

MediumVC
Any-to-any voice conversion using synthetic specific-speaker speeches as intermedium features
Stars: ✭ 46 (+84%)
Mutual labels:  speech-synthesis, voice-conversion, vc
YourTTS
YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone
Stars: ✭ 217 (+768%)
Mutual labels:  speech-synthesis, voice-conversion
voice-conversion
an tutorial implement of voice conversion using pytorch
Stars: ✭ 26 (+4%)
Mutual labels:  speech-synthesis, voice-conversion
ppg-vc
PPG-Based Voice Conversion
Stars: ✭ 154 (+516%)
Mutual labels:  speech-synthesis, voice-conversion
Espnet
End-to-End Speech Processing Toolkit
Stars: ✭ 4,533 (+18032%)
Mutual labels:  speech-synthesis, voice-conversion
DaisyXMusic
Free and Open Source Group Voice chat music player for telegram ❤️ with button support youtube playback support
Stars: ✭ 204 (+716%)
Mutual labels:  vc
NanoFlow
PyTorch implementation of the paper "NanoFlow: Scalable Normalizing Flows with Sublinear Parameter Complexity." (NeurIPS 2020)
Stars: ✭ 63 (+152%)
Mutual labels:  speech-synthesis
ttsflow
tensorflow speech synthesis c++ inference for voicenet
Stars: ✭ 17 (-32%)
Mutual labels:  speech-synthesis
Wavegrad
Implementation of Google Brain's WaveGrad high-fidelity vocoder (paper: https://arxiv.org/pdf/2009.00713.pdf). First implementation on GitHub.
Stars: ✭ 245 (+880%)
Mutual labels:  speech-synthesis
Expressive-FastSpeech2
PyTorch Implementation of Non-autoregressive Expressive (emotional, conversational) TTS based on FastSpeech2, supporting English, Korean, and your own languages.
Stars: ✭ 139 (+456%)
Mutual labels:  speech-synthesis
Shifter
Pitch shifter using WSOLA and resampling implemented by Python3
Stars: ✭ 22 (-12%)
Mutual labels:  voice-conversion
sova-tts-engine
Tacotron2 based engine for the SOVA-TTS project
Stars: ✭ 63 (+152%)
Mutual labels:  speech-synthesis
idear
🎙️ Handsfree Audio Development Interface
Stars: ✭ 84 (+236%)
Mutual labels:  speech-synthesis
react-native-spokestack
Spokestack: give your React Native app a voice interface!
Stars: ✭ 53 (+112%)
Mutual labels:  speech-synthesis
voder
An emulation of the Voder Speech Synthesizer.
Stars: ✭ 19 (-24%)
Mutual labels:  speech-synthesis
audioslides.io
Use Amazon Polly, Google Slides and FFMpeg to create videos that can be updated at anytime by anyone. This project is written in Elixir.
Stars: ✭ 19 (-24%)
Mutual labels:  speech-synthesis
tacotron2
Pytorch implementation of "Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions", ICASSP, 2018.
Stars: ✭ 17 (-32%)
Mutual labels:  speech-synthesis
GlottDNN
GlottDNN vocoder and tools for training DNN excitation models
Stars: ✭ 30 (+20%)
Mutual labels:  speech-synthesis
Catch-A-Waveform
Official pytorch implementation of the paper: "Catch-A-Waveform: Learning to Generate Audio from a Single Short Example" (NeurIPS 2021)
Stars: ✭ 117 (+368%)
Mutual labels:  speech-synthesis
IMS-Toucan
Text-to-Speech Toolkit of the Speech and Language Technologies Group at the University of Stuttgart. Objectives of the development are simplicity, modularity, controllability and multilinguality.
Stars: ✭ 295 (+1080%)
Mutual labels:  speech-synthesis

SingleVC

SingleVC performs any-to-one VC, which is an important component of MediumVC project. Here is the official implementation of the paper, MediumVC.

The following are the overall model architecture. Model architecture

For the audio samples, please refer to our demo page. The more details can be found in "any2one/demo_page/ConvertedSpeeches/".

Envs

You can install the dependencies with

pip install -r requirements.txt

PSDR

PSDR means scaling F0 and correlative harmonics with duration remained, which intuitively modifying the speaker-related information while maintaining linguistic content and prosodic information. PSDR can be used as a data augment strategy for VC by producing fake parallel corpus. To verify its feasibility that slight pitch shifts don't affect content information, we measure the word error rate(WER) between source speeches and pitch-shifted speeches through Wav2Vec2-based ASR System. The speeches of p249(female) from VCTK Corupsis selected, and pyrubberband is utilized to execute PSDR. Table indicates that when S in -6~4, the strategy applies to VC with acceptable WERs.

S -7 -6 -5 0 3 4 5
WER(%) 40.51 25.79 17.25 0 17.27 25.21 48.14

Vocoder

The HiFi-GAN vocoder is employed to convert log mel-spectrograms to waveforms. The model is trained on universal datasets with 13.93M parameters. Through our evaluation, it can synthesize 22.05 kHz high-fidelity speeches over 4.0 MOS, even in cross-language or noisy environments.

pretrained models

You can download the pretrained model as well as the vocoder following the link, and then edit the config file any2one/infer/infer_config.yaml. Infer corpus should be organized as test22050/*.wav You can convert an list of utterances, e.g.

python any2one/infer/infer.py

Train from scratch

select acceptable pitch shifts

If you want to reconstruct someone's voice, you need to calculate the acceptable pitch shifts of that person first. Edit the "any2one/tools/wav2vec_asr.py" and config the "wave_dir" as "speech16000_dir". The ASR model provided in "any2one/tools/wav2vec_asr.py" only supports English speech recognition currently. You can replace it for other languages. In our test, the acceptable pitch shifts of p249 in VCTK-Corups are [-6,4].

python any2one/tools.wav2vec_asr.py

tips: In practice, it performers a higher probability of success to build female voices than male voices . Compared to males, the periodic patterns of females perform more stable due to the higher frequency resolution.

The train corpus should be organized as vctk22050/p249/*.wav

python any2one/solver.py

Preprocessing

  1. The model is trained with random pitch shifted speeches processed in real-time. If you want to speed up the training, please refer the code in "any2one/meldataset.py" to have data preprocessed.
  2. If use preprocess method in HiFi-GAN vocoder, the training will take about one day with TITAN Xp, and the performances will be more robust. However, using preprocess method in WaveRNN, the training will just spend three hours.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].