Quantum Deep Learning for Speech

Quantum Machine Learning for Automatic Spoken-Term Recognition.

NEW Our paper is accepted to IEEE International Conference on Acoustics, Speech, & Signal Processing (ICASSP) 2021.

We would like to thank the reviewers and committee members in the Speech Processing and Quantum Signals community.

Released the quantum speech processing code! (2020 Dec) Colab demo is also provided. ICASSP Video | Slides

ICASSP 21 Paper | Arxiv "Decentralizing Feature Extraction with Quantum Convolutional Neural Network for Automatic Speech Recognition"

1. Environment

option 1: from conda and pip install

conda install -c anaconda tensorflow-gpu=2.0
conda install -c conda-forge scikit-learn 
conda install -c conda-forge librosa 
pip install pennylane --upgrade

option 2: from environment.yml (for 2080 Ti with CUDA 10.0)

conda env create -f environment.yml

Origin with tensorflow 2.0 with CUDA 10.0.

2. Dataset

We use Google Speech Commands Dataset V1 for Limited-Vocabulary Speech Recognition.

mkdir ../dataset
cd ../dataset
wget http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz
tar -xf speech_commands_v0.01.tar.gz

2.1. Pre-processed Features

We provide 2000 pre-processed feautres in ./data_quantum, which included both mel features, and (2,2) quanvolution features with 1500 for training and 500 for testing. You could get 90.6% test accuracy by the provided data.

You could use np.load to load these features to train your own quantum speech processing model in 3.1.

2.2. Audio Features Extraction (optional)

Please set the sampling rate sr and data ratio (--port N for 1/N data; --port 1 for all data) for extracting Mel Features.

python main_qsr.py --sr 16000 --port 100 --mel 1 --quanv 1

2.3. Quanvolution Encoding (optional)

If you have pre-load audio features from 2.2. you can set the quantum convolution kernal size in helper_q_tool.py function quanv. We provide an example for kernal size = 3 in line 57.

You will see a message below during the Quanvolution Encoding with features extraction comment from 2.2..

===== Shape 60 126
Kernal =  2
Quantum pre-processing of train Speech:
2/175

3. Training

3.1 QCNN U-Net Bi-LSTM Attention Model

Spoken Terms Recognition with additional U-Net Encoder discussed in our work.

python main_qsr.py

In 25 epochs. One way to improve the recognition system performance is to encode more data for training, refer to 2.2. and 2.3.

1500/1500 [==============================] - 3s 2ms/sample - val_loss: 0.4408 - val_accuracy: 0.9060

Alternatively, training without U-Net as the method proposed in Douglas C. de Andrade et al. similar to their implementation but without kapre layers.

Please set use_Unet = False. in model.py.

def attrnn_Model(x_in, labels, ablation = False):
    # simple LSTM
    rnn_func = L.LSTM
    use_Unet = False

3.2 Neural Saliency by Class Activation Mapping (CAM)

python cam_sp.py

3.3 CTC Model for Automatic Speech Recognition

We also provide a CTC model with Word Error Rate (WER) evaluation for future studies to the community refer to the discussion.

For example, an output "y-e--a" of input "yes" is identified as an incorrect word with the CTC alignment.

Noted this Quantum ASR CTC version is only supported tensorflow-gpu==2.3. Please create a new environment for running this experiment.

unzip the features for asr

cd data_quantum/asr_set
bash unzip.sh

run the ctc model in ./speech_quantum_dl

python qsr_ctc_wer.py

Result pre-trained weight in `checkpoints/asr_ctc_demo.hdf5`

Epoch 32/50
107/107 [==============================] - 5s 49ms/step - loss: 0.1191 - val_loss: 0.7115
Epoch 33/50
107/107 [==============================] - 5s 49ms/step - loss: 0.1547 - val_loss: 0.6701
=== WER: 9.895833333333334  %

Tutorial Link.

Only for academic purpose. Feel free to contact the author for the other purposes.

Reference

If you think this work helps your research or use the code, please consider reference our paper. Thank you!

@inproceedings{yang2021decentralizing,
  title={Decentralizing feature extraction with quantum convolutional neural network for automatic speech recognition},
  author={Yang, Chao-Han Huck and Qi, Jun and Chen, Samuel Yen-Chi and Chen, Pin-Yu and Siniscalchi, Sabato Marco and Ma, Xiaoli and Lee, Chin-Hui},
  booktitle={2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={6523--6527},
  year={2021},
  organization={IEEE}
}

Federated Learning and Virtualization

See PySyft and PyVertical for vertical federated learning setup. Please refer to a veritical learning example for virtualization.

Acknowledgment

We would like to appreciate Xanadu AI for providing the PennyLane and IBM research for providing qiskit and quantum hardware to the community. There is no conflict of interest.

FAQ

Since the area between speech and quantum ML is still quite new, please feel free to open a issue for discussion.

Feel free to use this implementation for other speech processing or sequence modeling tasks (e.g., speaker recognition, speech seperation, event detection ...) as the quantum advantages discussed in the paper.

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

huckiyang / QuantumSpeech-QCNN

Programming Languages

Labels

Projects that are alternatives of or similar to QuantumSpeech-QCNN

Quantum Deep Learning for Speech

1. Environment

2. Dataset

2.1. Pre-processed Features

2.2. Audio Features Extraction (optional)

2.3. Quanvolution Encoding (optional)

3. Training

3.1 QCNN U-Net Bi-LSTM Attention Model

3.2 Neural Saliency by Class Activation Mapping (CAM)

3.3 CTC Model for Automatic Speech Recognition

Result pre-trained weight in `checkpoints/asr_ctc_demo.hdf5`

Reference

Federated Learning and Virtualization

Acknowledgment

FAQ

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

huckiyang / QuantumSpeech-QCNN

Programming Languages

Labels

Projects that are alternatives of or similar to QuantumSpeech-QCNN

Quantum Deep Learning for Speech

1. Environment

2. Dataset

2.1. Pre-processed Features

2.2. Audio Features Extraction (optional)

2.3. Quanvolution Encoding (optional)

3. Training

3.1 QCNN U-Net Bi-LSTM Attention Model

3.2 Neural Saliency by Class Activation Mapping (CAM)

3.3 CTC Model for Automatic Speech Recognition

Result pre-trained weight in checkpoints/asr_ctc_demo.hdf5

Reference

Federated Learning and Virtualization

Acknowledgment

FAQ

Result pre-trained weight in `checkpoints/asr_ctc_demo.hdf5`