Time-Domain Filterbanks

PyTorch implementation of Learning Filterbanks from Raw Speech for Phone Recognition (ICASSP 2018).

Time-Domain Filterbanks (TD-filterbanks) are neural network layers intended to operate on a raw audio waveform. At initialization, they approximate standard mel-filterbanks by computing first-order scattering coefficients. They can then be fine-tuned with the architecture. Options of mel-filterbanks can be specified, such as a pre-emphasis layer, a log compression of the coefficients, or their mean-variance normalization.

Different types of TD-Filterbanks

There are four different modes for TD-filterbanks:

Fixed: Initialize the layers to match mel-filterbanks and keep their parameters fixed when training the model
Learn-all: Initialize the layers and let the filterbank and the averaging be learned jointly with the model
Learn-filterbank: Start from the initialization and only learn the filterbank with the model, keeping the averaging fixed to a squared hanning window
Randinit: Initialize the layers randomly and learn them with the network

TD-filterbanks

Time-Domain Filterbanks are a neural architecture composed of a complex-valued convolution, a modulus operator and a grouped real-valued convolution. This structure is based on the computation of first-order scattering coefficients. They are generated by a call to the class TDFbanks:

import melfilters
import utils
import model
# Main parameters
layer_params = dict(mode='fixed',           # type of td-fbanks (fixed, learnall, learnfbanks)
                    nfilters=40,            # number of filters
                    samplerate=16000,       # samplerate of the waveform
                    wlen=25,                # length of the window (in milliseconds)
                    wstride=10,             # stride of the window
                    compression='log',      # compression of coefficients (log or None)
                    preemp=True,            # add a pre-emphasis layer below the td-fbanks
                    mvn=True)               # perform mean-variance normalization per utterance on the coefficients

tdfbanks = model.TDFbanks(**layer_params)

Initialization

When Time-Domain Filterbanks are generated, the weights of the convolutional layers are initialized randomly. With mode="learnall" and without initialization, this corresponds to the randinit type of TD-filterbanks. One can initialize them to match standard mel-filterbanks:

# Initialization parameters
init_params = dict(min_freq=0,              # minimum frequency spanned by the filters
                   max_freq=8000,           # maximum frequency spanned by the filters
                   nfft=512,                # number of frequency bins for the mel-filterbanks to replicate
                   window_type='hamming',   # windowing function
                   normalize_energy=False,  # replicate mel-filterbanks normalized or energy or that peak at 1
                   alpha=0.97)              # pre-emphasis parameter

tdfbanks.initialize(**init_params)

Dependencies

Python 2/3 with NumPy
PyTorch
CUDA

Installation

Simply clone the repository:

git clone https://github.com/facebookresearch/tdfbanks.git
cd tdfbanks

References

If you find this code useful, please consider citing:

Learning Filterbanks from Raw Speech for Phone Recognition - N. Zeghidour, N. Usunier, I. Kokkinos, T. Schatz, G. Synnaeve, E. Dupoux

@inproceedings{zeghidour2017learning,
  title={Learning Filterbanks from Raw Speech for Phone Recognition},
  author={Zeghidour, Neil and Usunier, Nicolas and Kokkinos, Iasonas and Schatz, Thomas and Synnaeve, Gabriel and Dupoux, Emmanuel},
  booktitle={Acoustics, Speech and Signal Processing (ICASSP), 2018 IEEE International Conference on},
  year={2018},
  organization={IEEE}
}

Contact: [email protected]

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

facebookresearch / tdfbanks