Programming Languages

Jupyter Notebook

11667 projects

python

139335 projects - #7 most used programming language

Labels

wavenet-pytorch

Audio Source Separation

Obtain accompaniment and vocals from mix music

Reference

WaveNet

WaveNet structure

Encode audio by mu-law and then quantize it to 256 possible values
Input is a quantized audio array, for example, input.shape = L. L is the length of the audio.
Causal Conv is a norm convolutional layer
Dilated Conv is shown as above figure.
Left yellow circle is a tanh fuction and right yellow circle is sigmoid
Red circle denotes an element-wise multiplication operator Tanh(DilatedConv0(x)) * sigmoid(DilatedConv1(x))
Green square are two norm convolutional layers with 1 * 1 kernel size
One convolutional layer's output is followed by the residual summation, and the other convolutional layer's output is skip connections
K is the layer
Each block has a skip connection, red circle sums these skip connections up
And then relu function, 1 * 1 kernel conv layer, relu, 1 * 1 conv layer and a softmax
Output is quantized audio array, for example, output shape is 256 * L. 256 is 256 possible quantized values and L is the length of the audio.
Map output to [-1,1] and then decode it to raw audio array.

WaveNet for Audio Source Separation

A is mix audio, B is vocals and C is accompaniment.
The deepmind wavenet's input and label are only A
I use A as input and B as label
As shown in above figure, I slightly changed the dilated conv layers
I use A[0:100] to predict B[50] instead of using A[0:50] to predict A[50]

A Universal Music Translation Network

facebook net structure

The encoder is a fully convolutional network
The encode part has three blocks of 10 residual-layers as shwon in the first above figure.
the NC Dilated Conv layer is Dilated Conv layer
After the three blocks, there is an additional 1 * 1 layer
An average pooling with a kernel size of 800(if sample size for one second is 16000) follows
And then domain confusion loss, I re-implemented a domain confusion in there.
Upsampled to the original audio rate using nearest neighbor interpolation
The above figure is new version wavenet
The encoding audio is used to condition a WaveNet decoder. The conditioning signal is passed through a 1 × 1 layer that is different for each WaveNet layer
The WaveNet decoder has 4 blocks of 10 residual-layers
The input and output are quantized using 8-bit mu-law encoding
Loss fuction is softmax

Data Augmentation for FacebookNet

Uniformly select a segment of length between 0.25 and 0.5 seconds
Modulate its pitch by a random number between -0.5 and 0.5 of half-steps

Facebook Net for Audio Source Separation

Structure A, I made the decoding part to be same as encoding, removed downsample and upsample, removed confusion loss.
I used data augmentation strategy from u-wave-net paper. For example, A is mix audio, B is vocals and C is accompaniment. B * factor0 + C * factor1 = newA, I used newA as input and C*factor1 as label. Factor0 and factor1 is chosen uniformly from the interval [0.7, 1.0].
I used Ccmixter as dataset. Ccmixter has 3 Children's songs, two songs as training data and the other as testing data, the result on testing data is also very good even though is slightly worse than training data.
Three rap songs can also generalize well.
Two songs have different background music and same lyrics(two same voice), generalization is also ok, but worse than above two situations
First 45 songs for training and last 5 songs for testing, the results is still not good.
If I chose 9 different types of music, even in training set, the result is not good. I am trying to solve this problem.

Some other tests

Add downsample and upsample, add confusion loss, use short time fourier transform to preprocess the raw audio. The results are worse than structure A.

Domain confusion loss

I implemented a domain confusion loss in there.
My result is better than original paper's result, but when I add to structure A, the result became very bad. Because I think that I need the domain information when I generate the music without voice. I should keep the original music and accompaniment having same type.

TODO for facebook net

Try to add decoding part to structure A. The bottleneck during inference is the autoregressive process done by the WaveNet, try to use dedicated CUDA kernels code by NVIDIA

U-Wave-Net

U-Wave-Net Structure

Use LeakyReLU activation except for the final one, which uses tanh
Downsampling discards features for every other time step to halve the time resolution
Concat concatenates the current high-level features with more local features x
Since they do not padding zeros and so they need to crop for concatenating.
Input and output are raw audio.
Loss function is mean squared error

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

ShichengChen / Audio-Source-Separation

Programming Languages

Labels

Table of Contents

Audio Source Separation

Reference

WaveNet

WaveNet structure

WaveNet for Audio Source Separation

A Universal Music Translation Network

facebook net structure

Data Augmentation for FacebookNet

Facebook Net for Audio Source Separation

Some other tests

Domain confusion loss

TODO for facebook net

U-Wave-Net

U-Wave-Net Structure

Data Augmentation for U-Wave-Net

Result for the UWaveNet

DeepConvSep