All Projects → ronggong → interspeech2018_submission01

ronggong / interspeech2018_submission01

Licence: AGPL-3.0 license
Supplementary information and code for INTERSPEECH 2018 paper: Singing voice phoneme segmentation by hierarchically inferring syllable and phoneme onset positions

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to interspeech2018 submission01

Machine Learning Code
《统计学习方法》与常见机器学习模型(GBDT/XGBoost/lightGBM/FM/FFM)的原理讲解与python和类库实现
Stars: ✭ 169 (+293.02%)
Mutual labels:  hmm
pymc3-hmm
Hidden Markov models in PyMC3
Stars: ✭ 81 (+88.37%)
Mutual labels:  hmm
HTK
The Hidden Markov Model Toolkit (HTK) from University of Cambridge, with fixed issues.
Stars: ✭ 23 (-46.51%)
Mutual labels:  hmm
mahjong
开源中文分词工具包,中文分词Web API,Lucene中文分词,中英文混合分词
Stars: ✭ 40 (-6.98%)
Mutual labels:  hmm
HiddenMarkovModel
Python implementation of Hidden Markov Model, with demo of Chinese Part-of-Speech tagging
Stars: ✭ 16 (-62.79%)
Mutual labels:  hmm
bioinf-commons
Bioinformatics library in Kotlin
Stars: ✭ 21 (-51.16%)
Mutual labels:  hmm
CCAligner
🔮 Word by word audio subtitle synchronisation tool and API. Developed under GSoC 2017 with CCExtractor.
Stars: ✭ 131 (+204.65%)
Mutual labels:  forced-alignment
Gse
Go efficient multilingual NLP and text segmentation; support english, chinese, japanese and other. Go 高性能多语言 NLP 和分词
Stars: ✭ 1,695 (+3841.86%)
Mutual labels:  hmm
mchmm
Markov Chains and Hidden Markov Models in Python
Stars: ✭ 89 (+106.98%)
Mutual labels:  hmm
reacnetgenerator
an automatic reaction network generator for reactive molecular dynamics simulation
Stars: ✭ 25 (-41.86%)
Mutual labels:  hmm
HMMBase.jl
Hidden Markov Models for Julia.
Stars: ✭ 83 (+93.02%)
Mutual labels:  hmm
citar
Citar HMM part-of-speech tagger
Stars: ✭ 16 (-62.79%)
Mutual labels:  hmm
CIP
Basic exercises of chinese information processing
Stars: ✭ 32 (-25.58%)
Mutual labels:  hmm
xinlp
把李航老师《统计学习方法》的后几章的算法都用java实现了一遍,实现盒子与球的EM算法,扩展到去GMM训练,后来实现了HMM分词(实现了HMM分词的参数训练)和CRF分词(借用CRF++训练的参数模型),最后利用tensorFlow把BiLSTM+CRF实现了,然后为lucene包装了一个XinAnalyzer
Stars: ✭ 21 (-51.16%)
Mutual labels:  hmm
unsupervised-pos-tagging
教師なし品詞タグ推定
Stars: ✭ 16 (-62.79%)
Mutual labels:  hmm
Aeneas
aeneas is a Python/C library and a set of tools to automagically synchronize audio and text (aka forced alignment)
Stars: ✭ 1,942 (+4416.28%)
Mutual labels:  forced-alignment
libfmp
libfmp - Python package for teaching and learning Fundamentals of Music Processing (FMP)
Stars: ✭ 71 (+65.12%)
Mutual labels:  hmm
oddvoices
An indie singing synthesizer
Stars: ✭ 4 (-90.7%)
Mutual labels:  singing-voice
BayesHMM
Full Bayesian Inference for Hidden Markov Models
Stars: ✭ 35 (-18.6%)
Mutual labels:  hmm
LinLP
使用Python进行自然语言处理相关实践,如新词发现,主题模型,隐马尔模型词性标注,Word2Vec,情感分析
Stars: ✭ 43 (+0%)
Mutual labels:  hmm

INTERSPEECH 2018 phoneme segmentation

Singing voice phoneme segmentation by hierarchically inferring syllable and phoneme onset positions

The code in the repository is for reproducing the experiment results of the conference paper.

For the demo of the proposed algorithm, please check the jupyter notebook. You should be able to "open with" google colaboratory in you google drive, then "open in playground" to execute it block by block.

The code of the demo is in the distribute branch.

Contents

A. Paper complementary information

B. Code usage

Questions?
References

A Paper complementary information

A.1 Results example

results_example

An illustration of the result for a testing singing phrase.

  • The red and black vertical lines are respectively the syllable and phoneme onset positions (1st row: ground truth, 2nd and 3rd rows: proposed method detections, 4th row: baseline method detections).
  • The blue curves in the 2nd and 3rd row are respectively the syllable and phoneme ODFs.
  • The 4th row shows the syllable/phoneme labels on the y axis, emission probabilities matrix at the background and alignment path by the blue staircase line.

A.2 Annotation units for phoneme-level

  1. This table shows the annotation units used in 'pinyin', 'dianSilence' and 'details' tiers of each Praat TextGrid.

  2. Chinese pinyin and X-SAMPA format are given.

  3. b,p,d,t,k,j,q,x,zh,ch,sh,z,c,s initials are grouped into one representation (not a formal X-SAMPA symbol): c

  4. v,N,J (X-SAMPA) are three special pronunciations which do not exist in pinyin.

Structure Pinyin[X-SAMPA]
head initials m[m], f[f], n[n], l[l], g[k], h[x], r[r\'], y[j], w[w],
{b, p, d, t, k, j, q, x, zh, ch, sh, z, c, s} - group [c]
[v], [N], [J] - special pronunciations
medial vowels i[i], u[u], ü[y]
belly simple finals a[a"], o[O], e[7], ê[E], i[i], u[u], ü[y],
i (zhi,chi,shi) [1], i (ci,ci,si) [M],
compound finals ai[aI^], ei[eI^], ao[AU^], ou[oU^]
nasal finals an[an], en[@n], in[in],
ang[AN], eng[7N], ing[iN], ong[UN]
retroflexed finals er [@][r\']
tail i[i], u[u], n[n], ng[N]

A.3 Baseline forced alignment details

The baseline is a 1-state monophone DNN/HSMM model. We use monophone model because (i) our small dataset doesn't have enough phoneme instances for exploring the context-dependent triphones model, and (ii) Brognaux and Drugman[1] and Pakoci et al.[2] argued that context-dependent model can't bring significant alignment improvement. It is convenient to apply 1-state model because each phoneme can be represented by a semi-Markovian state carrying a state occupancy time distribution. The audio preprocessing step is the same as in the paper section 3.1.

Discriminative acoustic model: We use a CNN with softmax outputs as the discriminative acoustic model. According to the work of Renals et al.[3], a neural network with softmax outputs trained for framewise phoneme classification outputs the posterior probability p(q|x) (q: state, x: observation), which can be approximated as the acoustic model at the frame-level if we assume equal phoneme class priors. In our previous work, a one-layer CNN with multi-filter shapes has been designed. It has been experimentally proved that this architecture can successfully learn timbral characteristics and outperformed some deeper CNN architectures in the phoneme classification task for a small jingju singing dataset[4]. Thus, we use this one-layer CNN acoustic model for the baseline method. The same log-mel context introduced in section 3.1 is used as the model input and its phoneme class as the target label. The model predicts the phoneme class posterior probability for each context.

Coarse duration and state occupancy distribution: The baseline method receives the phoneme durations of teacher's singing phrase as the prior input. The phoneme durations are stored in a collapsed version of the M_p array (section 3.2.1):

collapsed_M_p

The silences are treated separately and have their independent durations.

The state occupancy is the time duration that the student sings on a certain phoneme state. It is expected to be the same duration as in the teacher's singing. We build the state occupancy distribution as a Gaussian, which has the same form state_occupancy as in section 3.2.1, where mu_n indicates in this context the nth phoneme duration of the teacher's singing. We set gamma empirically to 0.2 as we found this value works well in our preliminary experiment.

HSMM for phoneme boundaries and labels inference: We construct an HSMM for phoneme segment inference. The topology is a left-to-right semi-Markov chain, where the states represent the phonemes of the teacher's singing phrase sequentially. As we are dealing with the forced alignment, we constraint that the inference can only be started by the leftmost state and terminated to the rightmost state. The self-transition probabilities are set to 0 because the state occupancy depends on the predefined distribution. Other transitions - from current states to subsequent states are set to 1. The inference goal is to find best state sequence, and we use Guédon's HSMM Viterbi algorithm[5] for this purpose. The implementation code can be found in the path lyricsRecognizer. Finally, the segments are labeled by the alignment path, and the phoneme onsets are taken on the state transition time positions.

A.4 Phoneme and syllable onset detection results

We trained both proposed and baseline models 5 times with different random seeds. The mean and the std are reported.

A.4.1 Proposed method

Phoneme (mean, std) Syllable (mean, std)
Precision 75.73, 0.60 76.05, 0.41
Recall 74.77, 0.60 75.59, 0.40
F1 75.25, 0.60 75.82, 0.40

A.4.2 Baseline method

Phoneme (mean, std) Syllable (mean, std)
Precision 42.92, 0.89 41.16, 1.02
Recall 46.18, 0.96 40.91, 1.02
F1 44.49, 0.92 41.04, 1.02

B Code usage

B.1 First thing to do

  • Use python 2.7.* I haven't test the code on python3
  • Install the requirements.txt

B.2 Download 3 jingju solo singing voice datasets

part 1
part 2
part 3
If you only want to reproduce the experiment results in the paper, you only need to download the part 3 because the part 1 and 2 are used for training the models.

B.3 Set the paths

Once datasets are downloaded, you need to set the paths to let the program knows where are they.

B.3.1 Set the datasets path

What you need to set in ./general/filePathShared.py are:

  • Set path_jingju_dataset to the parent path of these three datasets.
  • Set primarySchool_dataset_root_path to the path of the interspeech2018 dataset (for reproducing the experiments).
  • Set nacta_dataset_root_path to the path of the jingju dataset part1 (for training the models).
  • Set nacta2017_dataset_root_path to the path the jingju dataset part2 (for training the models).

B.3.2 Set the training data path

And in both ./general/filePathHsmm.py and ./general/filePathJoint.py:

  • Set training_data_joint_path to where putting the training features, labels for the proposed joint model. (for training the models)
  • Set training_data_hsmm_path to where putting these files for the baseline HSMM emission model. (for training the models)

B.4 How to use pre-trained models to reproduce the results?

As you may see, there is a cnnModels folder in the repo, where we store all the pre-trained models. To use these models, you should run the following scripts:

  • proposed_method_pipeline.py will calculate the syllable and phoneme onset results using the proposed method, then save them to ./eval/results/joint/.
  • baseline_forced_alignment.py will calculate those results using the baseline HSMM forced alignment, then save them to ./eval/results/hsmm/.

For each model, we have trained five times, to get the mean and std statistics, you need to run eval_stats.py. The final results will be put in ./eval/hsmm/ or ./eval/joint/ folders.

  • *phoneme_onset_all.txt: phoneme onset detection results
  • *phoneme_segment_all.txt: phoneme segmentation results
  • *syllable_onset_all.txt: syllable onset detection results
  • *syllalbe_segment_all.txt: syllable segmentation results

There are two columns in each result file, the 1st column is the mean, the 2nd is the std. For onset detection results, the 3rd row is the f1-measure without considering the label and the tolerance is 0.025s.

B.5 How to get the features, labels and samples weights?

Make sure that you have downloaded all three datasets and set the training_data_joint_path and training_data_hsmm_path. In ./training_feature_collection folder, you can:

  • Run training_sample_collection_joint.py for the proposed method
  • Run training_sample_collection_hsmm.py for the baseline method

The training materials will be stored in the paths you have set.

B.6 How to train the models?

We have provided the training scripts. You can find them in ./model_training/train_scripts folder. Before running them, you need change the necessary paths (check B.3.2) to direct to the training materials which you obtained in the previous step.

Questions?

Feel free to open an issue or email me: rong.gong<at>upf.edu

References

  • [1] S. Brognaux and T. Drugman, "HMM-Based Speech Segmentation: Improvements of Fully Automatic Approaches," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 1, pp. 5-15, Jan. 2016. doi: 10.1109/TASLP.2015.2456421
  • [2] Pakoci E., Popović B., Jakovljević N., Pekar D., Yassa F. (2016) A Phonetic Segmentation Procedure Based on Hidden Markov Models. In: Ronzhin A., Potapova R., Németh G. (eds) Speech and Computer. SPECOM 2016. Lecture Notes in Computer Science, vol 9811. Springer,
  • [3] S. Renals, N. Morgan, H. Bourlard, M. Cohen and H. Franco, "Connectionist probability estimators in HMM speech recognition," in IEEE Transactions on Speech and Audio Processing, vol. 2, no. 1, pp. 161-174, Jan. 1994. doi: 10.1109/89.260359
  • [4] Pons, J., Slizovskaia, O., Gong, R., Gómez, E., & Serra, X. (2017, August). Timbre analysis of music audio signals with convolutional neural networks. In Signal Processing Conference (EUSIPCO), 2017 25th European (pp. 2744-2748)
  • [5] Guédon, Y., 2007. Exploring the state sequence space for hidden Markov and semi-Markov chains. Computational Statistics & Data Analysis, 51(5), pp.2379-2409.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].