All Projects → qlemaire22 → Speech Music Detection

qlemaire22 / Speech Music Detection

Licence: mit
Python framework for Speech and Music Detection using Keras.

Programming Languages

python
139335 projects - #7 most used programming language

Labels

Projects that are alternatives of or similar to Speech Music Detection

Language Modelling
Generating Text using Deep Learning in Python - LSTM, RNN, Keras
Stars: ✭ 38 (-32.14%)
Mutual labels:  lstm
Ssun
Spectral-Spatial Unified Networks for Hyperspectral Image Classification
Stars: ✭ 44 (-21.43%)
Mutual labels:  lstm
Tensorflow Lstm Sin
TensorFlow 1.3 experiment with LSTM (and GRU) RNNs for sine prediction
Stars: ✭ 52 (-7.14%)
Mutual labels:  lstm
Rnn Stocks Prediction
Another attempt to use Deep-Learning in the financial markets
Stars: ✭ 39 (-30.36%)
Mutual labels:  lstm
Avsr Deep Speech
Google Summer of Code 2017 Project: Development of Speech Recognition Module for Red Hen Lab
Stars: ✭ 43 (-23.21%)
Mutual labels:  lstm
Deepseqslam
The Official Deep Learning Framework for Route-based Place Recognition
Stars: ✭ 49 (-12.5%)
Mutual labels:  lstm
Neural Networks
All about Neural Networks!
Stars: ✭ 34 (-39.29%)
Mutual labels:  lstm
Image Captioning
Image Captioning: Implementing the Neural Image Caption Generator with python
Stars: ✭ 52 (-7.14%)
Mutual labels:  lstm
Pytorchtext
1st Place Solution for Zhihu Machine Learning Challenge . Implementation of various text-classification models.(知乎看山杯第一名解决方案)
Stars: ✭ 1,022 (+1725%)
Mutual labels:  lstm
Ner blstm Crf
LSTM-CRF for NER with ConLL-2002 dataset
Stars: ✭ 51 (-8.93%)
Mutual labels:  lstm
Char Rnn Keras
TensorFlow implementation of multi-layer recurrent neural networks for training and sampling from texts
Stars: ✭ 40 (-28.57%)
Mutual labels:  lstm
Sangita
A Natural Language Toolkit for Indian Languages
Stars: ✭ 43 (-23.21%)
Mutual labels:  lstm
Jambot
Stars: ✭ 50 (-10.71%)
Mutual labels:  lstm
Keras basic
keras를 이용한 딥러닝 기초 학습
Stars: ✭ 39 (-30.36%)
Mutual labels:  lstm
Time Attention
Implementation of RNN for Time Series prediction from the paper https://arxiv.org/abs/1704.02971
Stars: ✭ 52 (-7.14%)
Mutual labels:  lstm
Twitter Sentiment Analysis
Sentiment analysis on tweets using Naive Bayes, SVM, CNN, LSTM, etc.
Stars: ✭ 978 (+1646.43%)
Mutual labels:  lstm
Rnn Notebooks
RNN(SimpleRNN, LSTM, GRU) Tensorflow2.0 & Keras Notebooks (Workshop materials)
Stars: ✭ 48 (-14.29%)
Mutual labels:  lstm
Pointer Networks Experiments
Sorting numbers with pointer networks
Stars: ✭ 53 (-5.36%)
Mutual labels:  lstm
Text Classification Keras
📚 Text classification library with Keras
Stars: ✭ 53 (-5.36%)
Mutual labels:  lstm
Gym Continuousdoubleauction
A custom MARL (multi-agent reinforcement learning) environment where multiple agents trade against one another (self-play) in a zero-sum continuous double auction. Ray [RLlib] is used for training.
Stars: ✭ 50 (-10.71%)
Mutual labels:  lstm

Speech and Music Detection

Python framework for Speech and Music Detection using Keras.

This repository contains the experiments presented in the paper "Temporal Convolutional Networks for Speech and Music Detection in Radio Broadcast" by Quentin Lemaire and Andre Holzapfel at the 20th International Society for Music Information Retrieval conference (ISMIR 2019). Paper

Description

This framework is designed to easily evaluate new models and configurations for the speech and music detection task using neural networks. More details about this task can be found in the description page for the MIREX 2018 Speech/Music Detection task. The evaluation implemented in this framework is the same as the one described in this page for comparison purposes.

Different data pre-processing, data augmentation and architectures are already implemented and it is possible to easily add new methods and to train on different datasets.

Installation

The SoX command line utility is required for the dataset pre-processing. (HomePage).

Installation with HomeBrew:

brew install lame
brew reinstall sox --with-lame  # for mp3 compatibility

The implementation is based on several Python libraries, especially:

  • Keras for the deep learning implementations (link).
  • tensorflow as Keras backend (link).
  • Librosa for the pre-processing of the audio (link).
  • sed_eval for the evaluation of the models (link).
  • keras-tcn for the implementation of the TCN (link).
  • hyperas for hyper-parameters optimization on Keras with Hyperopt (link).

If you want to use prepare_dataset/mp2_to_wav.py to convert MP2 audio to WAV, you need the command line utility FFmpeg (HomePage).

Installation of the Python libraries with PyPi:

pip install -r requirements.txt

To listen to the audio while visualizing the annotations with smd.display.audio_with_events, you need the toolbox sed_vis that you can download on GitHub.

Configuration

The different parameters of the framework that are not supposed to be changed when comparing different architectures can be set in smd/config.py.

Those parameters are:

  • The preprocessing parameters like the sampling rate.
  • The data augmentation parameters.
  • The loss and metric used for the training.

Data

Labels

The label file of an audio can either be a text file containing one label for the whole file (speech, music or noise) or be a text file containing the list of the events happening in the audio. In the last case the audio will be considered "mixed" and the label file has to be formatted in this way:

t_start1 t_stop1 music/speech
t_start2 t_stop2 music/speech
...

Each value is separated by a tabulation.

Add a dataset

The dataset has to be separated into two folders:

  • The folder containing all the audio files and their corresponding label text files.
  • The folder containing the repartition of the data between each set (train, validation or test) for each type of label (speech, music, noise or mixed). The files contain the name of the corresponding audio with no extension and the possible files are mixed_train, mixed_val, mixed_test, music_train, music_val, speech_train, speech_val or noise_train.

Then, add the name of the two folders in datasets.json and run the file prepare_dataset/prepare_audio.py to do the processing pre-training of the audio.

Pre-processing

Prior to the learning phase, each audio file is resampled to a 22.05kHz mono audio of maximum 1mn30 (the files are split). Then a Short-Time Fourier Transform (STFT) with a Hann window, a frame length of 1024 and Hop size of 512 is applied and only the magnitude of the power spectrogram is kept. Those matrices are then stored for the learning phase. Those steps are done in the file prepare_dataset/prepare_audio.py.

During the learning phase, the spectrograms are loaded, then deformed by the data augmentation and a Mel Filterbank with 80 coefficients between 27.5 and 8000 Hz is applied. Those coefficients are then put on a log scale, normalized over the training set and inputted into the network. Those steps are implemented in train.py and could be changed to test new configurations.

Data Augmentation

Different transformations are applied to each training sample to do some data augmentation. To reduce the computation during the training, the data augmentation is not applied to the audio signal but on the magnitude of the power spectrogram. The implemented transformations are:

  • Time stretching
  • Pitch shifting
  • Random loudness
  • Block mixing
  • Gaussian frequency filter

Those deformations and the hyper-parameters used are based on the work of J. Schlüter and T. Grill [1].

Experiment configuration

You can configure the model and the parameters of the training in the file experiments.json. Different models and optimizers can be chosen for this task.

The configuration for an experiment must have the following fields:

"experiment_name": {
  "model": {
    ...
  },
  "dataset": [
    "list of the datasets to use"
  ],
  "batch_size": 32,
  "nb_epoch": 10,
  "target_seq_length": 270,
  "workers": 8,
  "use_multiprocessing": true
}

Models

New models can be added in the folder smd/models and loaded in smd/models/model_loader.py.

Here are the architectures already implemented with their configuration:

LTSM

"model": {
  "type": "lstm",
  "hidden_units": [100, 100, 100],
  "dropout": 0.2,
  "bidirectional": true,
  "optimizer": ...
}

CLDNN (Convolutional, Long Short-Term Memory, Fully Connected Deep Neural Networks)

"model": {
  "type": "cldnn",
  "filters_list": [32, 64],
  "lstm_units": [25, 50],
  "fc_units": [15],
  "kernel_sizes": [3, 5],
  "dropout": 0.351789,
  "optimizer": ...
}

TCN

"model": {
  "type": "tcn",
  "list_n_filters": [32],
  "kernel_size": 4,
  "dilations": [1, 2, 4, 8],
  "nb_stacks": 3,
  "n_layers": 1,
  "activation": "norm_relu",
  "dropout_rate": 0.05,
  "use_skip_connections": true,
  "bidirectional": true,
  "optimizer": ...
}

Optimizers

New optimizers can be added in the file smd/models/model_loader.py.

Here are the already implemented optimizers with their configuration.

SGD + momentum

"optimizer": {
  "name": "SGD",
  "lr": 0.001,
  "momentum": 0.9,
  "decay": 1e-6
}

Adam

"optimizer": {
  "name": "adam",
  "lr": 0.001,
  "beta_1": 0.9,
  "beta_2": 0.999,
  "epsilon": null,
  "decay": 0.0
}

Scripts

  • train.py to start the training of an experiment, different things can be decided in this file like the data pre-processing and augmentation, the learning rate scheduler and the early stopping.
  • evaluate.py to start the evaluation on the test set of a previously trained model. The pre-processing on the test set is decided in this file.
  • predict.py to pass one file or folder through the network and save the output.
  • hyper_params_opt.py to do the hyper-parameters optimization of a configuration with hyperas. Almost all the configuration for the hyper-parameters optimization has to be manually set in this file.
  • vizualize_data.py to listen to the audio while visualizing the prediction and/or ground-truth with sed_vis.

Real-time analysis of the audio

To analyze in real-time the audio taken from the microphone of the computer, one can try this Github Project that has been made to work with this framework.

You only need to put your trained model and the mean and std matrices of the dataset in the model folder of that project.

References

[1] "Exploring Data Augmentation for Improved Singing Voice Detection with Neural Networks" by Jan Schlüter and Thomas Grill at the 16th International Society for Music Information Retrieval Conference (ISMIR 2015).

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].