WASE

Overview

Demo samples of our paper Wase: Learning When to Attend for Speaker Extraction in Cocktail Party Environments. WASE first explicitly models start/end time of speech (onset/offset cues) in speaker extraction problem.

~~We will release code soon.~~ The source code is available in WASE_202112. If you have any question about implementation details, feel free to ask me ([email protected]).

Model

WASE is adapted based on our previous proposed framework, which includes five modules: voiceprint encoder, onset/offset detector, speech encoder, speech decoder, and speaker extraction module.

In this work, we focus on the onset/offset cues of speech and verify their effectiveness in speaker extraction task. We also combine the onset/offset cues and voiceprint cue. Onset/offset cues model start/end time of speech and voiceprint cue models the voice characteristics. The combination of two perceptual cues bring a significant performance improvement, while extra needed parameters are negligible. Please see the figure below for detailed model structure.

Datasets

WSJ0

The training samples are generated by randomly selecting speeches of different speakers from si_tr_s of WSJ0, and mixing them at various signal-to-noise ratios (SNR). The evaluating samples are generated by fixed list ./data/wsj/mix_2_spk_voiceP_tt_WSJ.txt. You may need the command below to modify the evaluating data path.

sed -i 's/home\/aa/YOUR PATH/g' data/wsj/mix_2_spk_voiceP_tt_WSJ.txt

Result

Audio Sample

Listen to audio samples at ./assets/demo.
Spectrogram samples (clean/mixture/prediction).

Onset / Offset Visualization.

This figure contains a lot of information about onset/offset cues.

The first modulation is almost straight. We attribute this strange phenomenon to its location near the input, where there is very little processing to mixture speech. This also reminds us that the first modulation position may be too early for onset/offset detection.
Except for the first modulation, the other modulation phenomena are in line with expectations. In particular, the last modulation is relatively stable with fewer spikes.
The last modulation has slight spikes between the start and the offset. We find that these spikes are consistent with the overlap of clean speech and inference speech. The detector seems to have lower confidence in these places than in other places where there is only clean speech or no clean speech at all.

Metric

Methods	#Params	SDRi(dB)
SBF-MTSAL	19.3M	7.30
SBF-MTSAL-Concat	8.9M	8.39
SpEx	10.8M	14.6
SpEx+	13.3M	17.2
WASE (onset / offset + voiceprint)	7.5M	17.05

Citations

If you find this repo helpful, please consider citing:

@inproceedings{hao2021wase,
  title={Wase: Learning When to Attend for Speaker Extraction in Cocktail Party Environments},
  author={Hao, Yunzhe and Xu, Jiaming and Zhang, Peng and Xu, Bo},
  booktitle={ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={6104--6108},
  year={2021},
  organization={IEEE}
}

@inproceedings{hao2020unified,
  title={A Unified Framework for Low-Latency Speaker Extraction in Cocktail Party Environments},
  author={Hao, Yunzhe and Xu, Jiaming and Shi, Jing and Zhang, Peng and Qin, Lei and Xu, Bo},
  booktitle={Proc. Interspeech 2020},
  pages={1431--1435},
  year={2020}
}

For more detailed descirption, you can further explore the whole paper with this link.

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

aispeech-lab / WASE

Labels

Projects that are alternatives of or similar to WASE