All Projects → craffel → Midi Dataset

craffel / Midi Dataset

Code for creating a dataset of MIDI ground truth

Projects that are alternatives of or similar to Midi Dataset

Synapse
Samples for Azure Synapse Analytics
Stars: ✭ 115 (-2.54%)
Mutual labels:  jupyter-notebook
Pandas
pandas cheetsheet
Stars: ✭ 118 (+0%)
Mutual labels:  jupyter-notebook
Qiskit Tutorials
A collection of Jupyter notebooks showing how to use the Qiskit SDK
Stars: ✭ 1,777 (+1405.93%)
Mutual labels:  jupyter-notebook
Reinvent2019 Aim362 Sagemaker Debugger Model Monitor
Build, train & debug, and deploy & monitor with Amazon SageMaker
Stars: ✭ 118 (+0%)
Mutual labels:  jupyter-notebook
Amazonsagemakercourse
SageMaker Course Material
Stars: ✭ 118 (+0%)
Mutual labels:  jupyter-notebook
Vae Tensorflow
A Tensorflow implementation of a Variational Autoencoder for the deep learning course at the University of Southern California (USC).
Stars: ✭ 117 (-0.85%)
Mutual labels:  jupyter-notebook
Pygame
Games that i did using pygame library
Stars: ✭ 118 (+0%)
Mutual labels:  jupyter-notebook
Tensorflow shiny
A R/Shiny app for interactive RNN tensorflow models
Stars: ✭ 118 (+0%)
Mutual labels:  jupyter-notebook
Teach Me Quantum
⚛ 10 week Practical Course on Quantum Information Science and Quantum Computing - with Qiskit and IBMQX
Stars: ✭ 118 (+0%)
Mutual labels:  jupyter-notebook
Statistical Learning Method
《统计学习方法》笔记-基于Python算法实现
Stars: ✭ 1,643 (+1292.37%)
Mutual labels:  jupyter-notebook
Neural Painters Pytorch
PyTorch library for "Neural Painters: A learned differentiable constraint for generating brushstroke paintings"
Stars: ✭ 118 (+0%)
Mutual labels:  jupyter-notebook
Adaptiveneuraltrees
Adaptive Neural Trees
Stars: ✭ 119 (+0.85%)
Mutual labels:  jupyter-notebook
Reinforcementlearning Atarigame
Pytorch LSTM RNN for reinforcement learning to play Atari games from OpenAI Universe. We also use Google Deep Mind's Asynchronous Advantage Actor-Critic (A3C) Algorithm. This is much superior and efficient than DQN and obsoletes it. Can play on many games
Stars: ✭ 118 (+0%)
Mutual labels:  jupyter-notebook
Fusion360gallerydataset
Data, tools, and documentation of the Fusion 360 Gallery Dataset
Stars: ✭ 118 (+0%)
Mutual labels:  jupyter-notebook
Planet Amazon Deforestation
The open source repository for the Kaggle Amazon forest devastation competition https://www.kaggle.com/c/planet-understanding-the-amazon-from-space
Stars: ✭ 118 (+0%)
Mutual labels:  jupyter-notebook
Machinelearninginjulia2020
Resources for a 3.5 hour workshop on machine learning using the MLJ toolbox
Stars: ✭ 118 (+0%)
Mutual labels:  jupyter-notebook
Bitcoin Price Prediction Using Sentiment Analysis
Predicts real-time bitcoin price using twitter and reddit sentiment, and sends out notifications via SMS.
Stars: ✭ 118 (+0%)
Mutual labels:  jupyter-notebook
Senato.py
A scraper for the data made available by the Italian Senate, and a cluster analysis to detect similar amendments.
Stars: ✭ 118 (+0%)
Mutual labels:  jupyter-notebook
Ysda deeplearning17
Yandex SDA classes on deep learning. Version of year 2017
Stars: ✭ 118 (+0%)
Mutual labels:  jupyter-notebook
Pytextrank
Python implementation of TextRank for phrase extraction and summarization of text documents
Stars: ✭ 1,675 (+1319.49%)
Mutual labels:  jupyter-notebook

MIDI Dataset

The goal of this project is to match and align a very large collection of MIDI files to a very large collection of audio files so that the MIDI data can be used to infer ground truth information about the audio. Alternatively, this repository contains code for reproducing most of the results in [1], which describes the goals, ideas, and research behind this project in much greater detail.

Notes

  • If you're looking for a high-level overview of the techniques used in this project and the results, take a look at chapter 1 of my thesis [1].

  • This repository contains code for performing the matching; if you're looking for the "Lakh MIDI Dataset" itself (the result of using this code to match a collection of 178,561 MIDI files to the Million Song Dataset), you can find that here.

  • If you just want a tutorial on potential uses of the Lakh MIDI dataset, take a look at the Tutorial.ipynb notebook.

  • Over time, this project has undergone some restructuring; if you're looking for the version of this repository used in the experiments in [2], check this tag.

Prerequisites

Before utilizing the code in this repository, you need to gather some data and software.

Data

Create a folder called data in the root of this repository. In it, you need the following subdirectories:

  • clean_midi, which should contain the "clean MIDI subset", as described in section 5.2.1 of [1]. These MIDI files should live in data/clean_midi/mid. You can obtain this collection here.
  • unique_midi, which should contain LMD-full, the 176,581 files of the Lakh MIDI dataset (aka LMD-full). These MIDI files should live in data/unique_midi/mid. You can obtain this collection here.
  • uspop2002, cal10k, cal500, and msd, which should each contain audio files from each respective dataset (msd being the 7digital preview clips corresponding to the Million Song Dataset). The MP3 files should live in, e.g., data/uspop2002/mp3. Unfortunately, obtaining these MP3 files is non-trivial. If you need help tracking them down, please contact me directly.

File lists

All of the datasets in the data subdirectory (except for unique_midi) should have a corresponding file list in the file_lists subdirectory. The only one which is not included in this repository is msd.txt; you can obtain that from the MSD directly (it's distributed with the MSD as unique_tracks.txt) or you can also download it here and rename msd.txt.

Software

All of the code in this repository is written for Python 2.7; it will likely need modification to work with Python 3.x. Here is a potentially incomplete list of the Python libraries used in this project:

  • numpy
  • scipy
  • librosa
  • pretty_midi
  • whoosh
  • joblib
  • deepdish
  • dhs
  • pse
  • msgpack
  • msgpack_numpy
  • lasagne
  • theano
  • sklearn
  • djitw
  • simple_spearmint
  • spearmint

Hardware

All of this code was designed to be run on a server with 64 GB of ram, 12 CPU cores, an NVIDIA GTX 980 Ti GPU, and plenty of hard drive space. If your own setup has less resources, you may need to modify some of the scripts in various places so that they use an appropriate amount of RAM, parallel processes, etc. In any case, please note that running all of the experiments and steps from beginning to end will take a least a few weeks of compute time.

Process

The general structure of this repository is as follows: Collections of shared utilities (experiment_utils.py, feature_extraction.py, whoosh_search.py) live in the base level, one-time-use scripts for assembling data and performing the actual MIDI-to-audio matching live in the scripts directory, and experiments for evaluating the effectiveness of different matching techniques live in experiments. Any data/results generated by running these different files are written out to a results directory. To re-run all of the experiments, matching, etc., proceed as described below.

  1. Run create_whoosh_indices.py. This uses the file lists to create Whoosh indices, which allow for fuzzy text matching of metadata. We use this fuzzy text matching to create training data for different matching algorithms. The indices are written out to, e.g., data/msd/index/.
  2. Run text_match_datasets.py. This uses the Whoosh indices to match MIDI files from clean_midi (which ostensibly may have reliable metadata) to entries in the different audio datasets. It also takes care to group audio files which are recordings of the same song. The results are written to results/text_matches.js.
  3. Run create_msd_cqts.py. This pre-computes constant-Q spectrograms for every entry in the Million Song Dataset, which saves time later on as we will need these for various steps throughout the process. They are written to data/msd/h5.
  4. Run align_text_matches.py. This uses dynamic time warping (specifically the approach proposed in [3]) to align each MIDI-audio pair found by metadata matching. The results are written to results/clean_midi_aligned, and include both the aligned MIDI files in results/clean_midi_aligned/mid and "diagnostics files" in results/clean_midi_aligned/h5. The diagnostics files contain information about whether each match is truly a match (an incorrect match can be caused e.g. by incorrect metadata or a bad transcription).
  5. Run split_training_data.py. This splits the matches into train, validation, development, and test collections which are used for evaluating each of the different matching approaches implemented in experiments.
  6. Run create_training_data.py. This inspects the results of align_text_matches.py to find good matches and generates training data for different matching approaches in a convenient format. It essentially produces saved constant-Q spectrograms of audio files, aligned MIDI files, unaligned MIDI files, and aligned MIDI piano rolls, in various folders in results.
  7. Run the experiments! Each subdirectory in the experiments directory corresponds to a different MIDI-audio matching technique. Each of these experiments at least contains a script called match_msd.py, which uses the matching technique to match each MIDI file in either the development or test set to the MSD and writes out the results. Most of the experiments have a script called precompute.py, which precomputes any necessary features/representation of entries in the development and test set. Finally, those experiments which are based on machine learning techniques also have a script parameter_search.py which trains any models necessary for performing the matching. In short, to run each of these experiments, run parameter_search.py if it exists, run precompute.py, and finally run match_msd.py. The results can be used to measure the effectiveness of each approach. There isn't a script which performs this analysis automatically, but there is a great deal of analysis in my thesis [1].
  8. To actually match the unique_midi collection to the Million Song Dataset, use the match.py script. For flexibility, this script takes a few command line arguments - first, a glob to MIDI files you want to match, and second, a path to where to write the results. To match the entire unique_midi dataset to the MSD, call it like so: python match.py ../data/unique_midi/mid/*/\*.mid output_path. This will produce (in output_path) one file for each MIDI file processed which lists potential matches in the MSD and the corresponding confidence scores.
  9. To assemble a collection of matched-and-aligned MIDI files, use the script assemble_aligned_matches.py. This will find all MIDI-audio matches produced by match.py which have a sufficiently high confidence score, re-align them, and write out the aligned MIDI file, along with the unaligned MIDI, MP3 file, and MSD H5, for convenience. In essence, this is how, at long last, each component of the Lakh MIDI dataset is produced.

References

  1. Colin Raffel. "Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching". PhD Thesis, 2016.
  2. Colin Raffel and Daniel P. W. Ellis. "Large-Scale Content-Based Matching of MIDI and Audio Files". Proceedings of the 16th International Society for Music Information Retrieval Conference, 2015.
  3. Colin Raffel and Daniel P. W. Ellis. "Optimizing DTW-Based Audio-to-MIDI Alignment and Matching". Proceedings of the 41st IEEE International Conference on Acoustics, Speech and Signal Processing, 2016.
  4. Colin Raffel and Daniel P. W. Ellis. "Pruning Subsequence Search with Attention-Based Embedding". Proceedings of the 41st IEEE International Conference on Acoustics, Speech and Signal Processing, 2016.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].