All Projects → jcsilva → multilingual-g2p

jcsilva / multilingual-g2p

Licence: other
Multilingual Grapheme to Phoneme

Programming Languages

shell
77523 projects

Projects that are alternatives of or similar to multilingual-g2p

voxpopuli
Python wrapper for Espeak and Mbrola, for simple local TTS
Stars: ✭ 21 (-47.5%)
Mutual labels:  espeak, phonemes
DeepPhonemizer
Grapheme to phoneme conversion with deep learning.
Stars: ✭ 152 (+280%)
Mutual labels:  phonemes, g2p
Deep-NLP-Resources
Curated list of all NLP Resources
Stars: ✭ 65 (+62.5%)
Mutual labels:  lexicon
myprosody
A Python library for measuring the acoustic features of speech (simultaneous speech, high entropy) compared to ones of native speech.
Stars: ✭ 162 (+305%)
Mutual labels:  phonemes
asr24
24-hour Automatic Speech Recognition
Stars: ✭ 27 (-32.5%)
Mutual labels:  g2p
gf-wordnet
A WordNet in GF
Stars: ✭ 15 (-62.5%)
Mutual labels:  lexicon
AffectiveTweets
A WEKA package for analyzing emotion and sentiment of tweets.
Stars: ✭ 74 (+85%)
Mutual labels:  lexicon
py-espeak-ng
Some simple wrappers around eSpeak NG intended to make using this excellent TTS for waveform and IPA generation as convenient as possible.
Stars: ✭ 27 (-32.5%)
Mutual labels:  espeak
JSpeak
A Text to Speech Reader Front-end that Reads from the Clipboard and with Exceptionable Features
Stars: ✭ 16 (-60%)
Mutual labels:  espeak
sam
Software Automatic Mouth - Tiny Speech Synthesizer
Stars: ✭ 316 (+690%)
Mutual labels:  phonemes
lexpy
Python package for lexicon; Trie and DAWG implementation.
Stars: ✭ 47 (+17.5%)
Mutual labels:  lexicon
wordhoard
This Python module can be used to obtain antonyms, synonyms, hypernyms, hyponyms, homophones and definitions.
Stars: ✭ 78 (+95%)
Mutual labels:  lexicon
OpenGNT
Open Greek New Testament Project; NA28 / NA27 Equivalent Text & Resources
Stars: ✭ 55 (+37.5%)
Mutual labels:  lexicon
myG2P
Myanmar (Burmese) Language Grapheme to Phoneme (myG2P) Conversion Dictionary for speech recognition (ASR) and speech synthesis (TTS).
Stars: ✭ 43 (+7.5%)
Mutual labels:  g2p
mlmorph
Malayalam Morphological Analyzer using Finite State Transducer
Stars: ✭ 40 (+0%)
Mutual labels:  lexicon
NRCLex
An affect generator based on TextBlob and the NRC affect lexicon. Note that lexicon license is for research purposes only.
Stars: ✭ 42 (+5%)
Mutual labels:  lexicon
Aeneas
aeneas is a Python/C library and a set of tools to automagically synchronize audio and text (aka forced alignment)
Stars: ✭ 1,942 (+4755%)
Mutual labels:  espeak
afinn
Sentiment Analysis in Javascript using the AFINN Lexicon
Stars: ✭ 26 (-35%)
Mutual labels:  lexicon
G2P
Grapheme To Phoneme
Stars: ✭ 59 (+47.5%)
Mutual labels:  g2p
memex-gate
General Architecture for Text Engineering
Stars: ✭ 47 (+17.5%)
Mutual labels:  lexicon

Multilingual Grapheme to Phoneme

Multilingual G2P based on espeak. Based on these ideas.

Languages

This G2P may be used in several languages. By defautl, it is configured for Brazilian Portuguese.

How to use

  • First of all, install espeak. On Ubuntu 14.04, sudo apt-get install espeak.

  • Create a words list with one word per line. The file words.egs included in this repository is an example.

  • Execute g2p.sh:

./g2p.sh -w words.egs 

The lexicon will be thrown in /dev/stdout.

  • You may choose a different language simply setting the parameter "l". For example, the following command line will generate a French lexicon.
./g2p.sh -w words.egs -l fr

The following languages are valid:

af (Afrikaans), bs (Bosnian), ca (Catalan), cs (Czech),
da (Danish), de (German), el (Greek), en (Default English),
en-us (American English), en-sc (Scottich English),
en-n (Northern British English), en-rp (Received Pronunciation British English),
en-wm (West Midlands British English), eo (Esperanto), es (Spanish),
es-la (Spanish - Latin America), fi (Finnish), fr (French), hr (Croatian),
hu (Hungarian), it (Italian), kn (Kannada), ku (Kurdish), lv (Latvian),
nl (Dutch), pl (Polish), pt (Portuguese (Brazil)), pt-pt (Portuguese (European)),
ro (Romanian), sk (Slovak), sr (Serbian), sv (Swedish), sw (Swahihi),
ta (Tamil), tr (Turkish), zh (Mandarin Chinese)

Create a Brazilian Portuguese list of words

  1. Get spelling dictionary, the license is LGPL version 2.1.

  2. Extract pt_BR.dic and pt_BR.aff files from the .oxt file that was downloaded in the previous step. It may be done using vim.

  3. Convert pt_BR.dic and pt_BR.aff to UTF-8:

iconv -f ISO8859-1 -t UTF-8 < pt_BR.dic > portuguese-brazilian-utf8.dic
iconv -f ISO8859-1 -t UTF-8 < pt_BR.aff > portuguese-brazilian-utf8.aff
  1. Change first line of file portuguese-brazilian-utf8.aff from SET ISO8859-1 to SET UTF-8.

  2. Install unmunch tool:

sudo apt-get install hunspell-tools
  1. Generate a list with Brazilian Portuguese words:
unmunch portuguese-brazilian-utf8.dic portuguese-brazilian-utf8.aff > portuguese-brazilian-wordlist

portuguese-brazilian-wordlist will have more than 80 million words and its size will be greater than 1 GB.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].