This repository contains scripts to reproduce a merged version of multiple open-source german speech datasets. For german there is no large speech corpus for automatic speech recognition tasks, as in english with for example librispeech. Therefore this repository combines multiple german speech corpora into a single one. Check licenses in the list below or on the sites of the specific datasets, if you want use the data for any special purposes.

Recreate

In order to recreate the same corpus as in this repository, execute the commands in the scripts recreate.sh. The scripts does the following steps.

Download all corpora to data/download. Only the common-voice corpus has to be downloaded manually and placed inside data/download/common_voice.
Merges all corpora into a single one. Furthermore creates specific subsets for train/dev/test.
Checks if the created corpus is equal to the given state of the repository. This is done by comparing hash values against the hash values in the file data/state.json.
If needed the corpus can be converted to wave files only. This will make sure every utterance is in a separate wave file with a sampling rate of 16000.

Corpus usage

The final corpus is stored in data/full. The format of the corpus is the default format of the audiomate library. It is described in audiomate default format.

Audiomate also can be used to read the corpus:

import audiomate

corpus = audiomate.Corpus.load('data/full')
utt = corpus.utterances['utt-idx']
transcript = utt.label_lists[audiomate.corpus.LL_WORD_TRANSCRIPT].join()
samples = utt.read_samples(sr=16000)

Checkout https://github.com/ynop/audiomate for more information.

Corpus Statistics

Part	h	Speakers
unfiltered	1021.31	not known due to the absence of info in M-Ailabs
train	536.90	not known due to the absence of info in M-Ailabs
dev	17.75	1151
test	18.22	2037
full_common_voice	324.19	4852
train_common_voice	10.20	552
dev_common_voice	7.04	1010
test_common_voice	7.71	1901
full_mailabs	233.66	-
train_mailabs	233.50	-
dev_mailabs	0.00	0
test_mailabs	0.00	0
full_swc	248.47	569
train_swc	238.01	527
dev_swc	4.26	26
test_swc	4.18	16
full_tuda	183.30	179
train_tuda	31.49	146
dev_tuda	2.41	16
test_tuda	2.38	17
full_voxforge	31.69	328
train_voxforge	23.70	126
dev_voxforge	4.04	99
test_voxforge	3.96	103

Corpus sources

Name	URL	License
Common-Voice	https://voice.mozilla.org/en/datasets	CC-0
TuDa	https://www.inf.uni-hamburg.de/en/inst/ab/lt/resources/data/acoustic-models.html	CC-BY
M-AILabs	https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/	See Page
VoxForge	http://www.voxforge.org/de	GPL
SWC	https://nats.gitlab.io/swc/	CC BY-SA 4.0

Create a new version

The scripts create.sh contains the commands to create a new version of the corpus.

Changelog

Version	Changes
v1	Initial version
v2	Smaller test sets, Filter long utterances (> 25s)

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

german-asr / megs

Programming Languages

Labels