All Projects → swshon → Dialectid_e2e

swshon / Dialectid_e2e

End to End Dialect Identification using Convolutional Neural Network

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Dialectid e2e

ttslearn
ttslearn: Library for Pythonで学ぶ音声合成 (Text-to-speech with Python)
Stars: ✭ 158 (+295%)
Mutual labels:  speech, dnn
Ios 10 Sampler
Code examples for new APIs of iOS 10.
Stars: ✭ 3,341 (+8252.5%)
Mutual labels:  cnn, speech
Caffe Hrt
Heterogeneous Run Time version of Caffe. Added heterogeneous capabilities to the Caffe, uses heterogeneous computing infrastructure framework to speed up Deep Learning on Arm-based heterogeneous embedded platform. It also retains all the features of the original Caffe architecture which users deploy their applications seamlessly.
Stars: ✭ 271 (+577.5%)
Mutual labels:  cnn, dnn
Lq Nets
LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks
Stars: ✭ 195 (+387.5%)
Mutual labels:  cnn, dnn
Rmdl
RMDL: Random Multimodel Deep Learning for Classification
Stars: ✭ 375 (+837.5%)
Mutual labels:  cnn, dnn
Speech Enhancement
Deep learning for audio denoising
Stars: ✭ 207 (+417.5%)
Mutual labels:  cnn, speech
Caffe Mobile
Optimized (for size and speed) Caffe lib for iOS and Android with out-of-the-box demo APP.
Stars: ✭ 316 (+690%)
Mutual labels:  cnn, dnn
3d Densenet
3D Dense Connected Convolutional Network (3D-DenseNet for action recognition)
Stars: ✭ 118 (+195%)
Mutual labels:  cnn, recognition
Cnn handwritten chinese recognition
CNN在线识别手写中文。
Stars: ✭ 365 (+812.5%)
Mutual labels:  cnn, recognition
Numpy neural network
仅使用numpy从头开始实现神经网络,包括反向传播公式推导过程; numpy构建全连接层、卷积层、池化层、Flatten层;以及图像分类案例及精调网络案例等,持续更新中... ...
Stars: ✭ 339 (+747.5%)
Mutual labels:  cnn, dnn
Depression Detect
Predicting depression from acoustic features of speech using a Convolutional Neural Network.
Stars: ✭ 187 (+367.5%)
Mutual labels:  cnn, speech
Java Speech Api
The J.A.R.V.I.S. Speech API is designed to be simple and efficient, using the speech engines created by Google to provide functionality for parts of the API. Essentially, it is an API written in Java, including a recognizer, synthesizer, and a microphone capture utility. The project uses Google services for the synthesizer and recognizer. While this requires an Internet connection, it provides a complete, modern, and fully functional speech API in Java.
Stars: ✭ 490 (+1125%)
Mutual labels:  speech, recognition
Keraspp
코딩셰프의 3분 딥러닝, 케라스맛
Stars: ✭ 178 (+345%)
Mutual labels:  cnn, dnn
fade
A Simulation Framework for Auditory Discrimination Experiments
Stars: ✭ 12 (-70%)
Mutual labels:  recognition, speech
Awesome Speech Recognition Speech Synthesis Papers
Automatic Speech Recognition (ASR), Speaker Verification, Speech Synthesis, Text-to-Speech (TTS), Language Modelling, Singing Voice Synthesis (SVS), Voice Conversion (VC)
Stars: ✭ 2,085 (+5112.5%)
Mutual labels:  cnn, dnn
Android Speech
Android speech recognition and text to speech made easy
Stars: ✭ 310 (+675%)
Mutual labels:  speech, recognition
Speechtotext Websockets Javascript
SDK & Sample to do speech recognition using websockets in Javascript
Stars: ✭ 191 (+377.5%)
Mutual labels:  speech, recognition
Tf2
An Open Source Deep Learning Inference Engine Based on FPGA
Stars: ✭ 113 (+182.5%)
Mutual labels:  cnn, dnn
Php Opencv Examples
Tutorial for computer vision and machine learning in PHP 7/8 by opencv (installation + examples + documentation)
Stars: ✭ 333 (+732.5%)
Mutual labels:  recognition, dnn
Food Recipe Cnn
food image to recipe with deep convolutional neural networks.
Stars: ✭ 448 (+1020%)
Mutual labels:  cnn, recognition

End-to-end Dialect Identification (implementation on MGB-3 Arabic dialect dataset)

Tensorflow implementation of End-to-End dialect identificaion in Arabic. If you are familiar with Language/Speaker identification/verification, it can be easily modified to another dialect, language or even speaker identification/verification tasks.

Requirement

  • Python, tested on 2.7.6
  • Tensorflow > v1.0
  • python library sox, tested on 1.3.2
  • python library librosa, tested on 0.5.1

Data list format

datalist consist of (location of wavfile) and (label in digit).

Example) "train.txt"

./data/wav/EGY/EGY000001.wav 0
./data/wav/EGY/EGY000002.wav 0
./data/wav/NOR/NOR000001.wav 4

Labels of Dialect:

  • Egytion (EGY) : 0
  • Gulf (GLF) : 1
  • Levantine(LAV): 2
  • Modern Standard Arabic (MSA) : 3
  • North African (NOR): 4

Dataset Augmentation

Augementation was done by two different method. First is random segment of the input utterance, and the other is perturbation by modifying speed and volume of speech.

Model definition

Simple description of the DNN model:

we used four 1-dimensional CNN (1d-CNN) layers (40x5 - 500x7 - 500x1 - 500x1 filter sizes with 1-2-1-1 strides and the number of filters is 500-500-500-3000) and two FC layers (1500-600) that are connected with a Global average pooling layer which averages the CNN outputs to produce a fixed output size of 3000x1.

End-to-end DID accuracy by epoch

End-to-end DID accuracy by epoch using augmented dataset

Performance comparison with and without Random Segmentation(RS)

Performance evaluation

Best performance is 73.39% on Accuracy. (Feb.28 2018)

for reference,

Conventional i-vector with SVM : 60.32%
Conventional i-vector with LDA and Cosine Distance : 62.60%
End-to-End model without dataset augmentation(MFCC): 65.55%
End-to-End model without dataset augmentation(FBANK): 64.81%
End-to-End model without dataset augmentation(Spectrogram): 57.57%

End-to-End model with volume perturbation(MFCC) : 67.49%
End-to-End model with speed perturbation(MFCC) : 70.51%

End-to-End model with speed and volume perturbation (MFCC) : 70.91%
End-to-End model with speed and volume perturbation (FBANK) : 71.92%
End-to-End model with speed and volume perturbation (Spectrogram) : 68.83%

End-to-End model with speed and volume perturbation+random segmention (MFCC) : 71.05%
End-to-End model with speed and volume perturbation+random segmention (FBANK) : 73.39%
End-to-End model with speed and volume perturbation+random segmention (Spectrogram) : 70.17%

Offline test

Offline test can be done in offline_test.ipynb code on our pretrained model. Specify wav file you want to identify Arabic dialect by modifying FILENAME variable.

FILENAME = ['/data/test/NOR_00001.wav']

Result can be shown like below bar plot of likelihood on 5 Arabic dialects.

Image of offline result plot

Relevant publication

[1] Suwon Shon, Ahmed Ali, James Glass,
Convolutional Neural Networks and Language Embeddings for End-to-End Dialect Recognition,
Proc. Odyssey 2018 The Speaker and Language Recognition Workshop, 98-104
https://arxiv.org/abs/1803.04567

Citing

@inproceedings{Shon2018,
  author={Suwon Shon and Ahmed Ali and James Glass},
  title={Convolutional Neural Network and Language Embeddings for End-to-End Dialect Recognition	},
  year=2018,
  booktitle={Proc. Odyssey 2018 The Speaker and Language Recognition Workshop},
  pages={98--104},
  doi={10.21437/Odyssey.2018-14},
  url={http://dx.doi.org/10.21437/Odyssey.2018-14}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].