All Projects → abuccts → wikt2pron

abuccts / wikt2pron

Licence: BSD-2-Clause license
A Python toolkit converting pronunciation in enwiktionary xml dump to cmudict format

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to wikt2pron

python-cmudict
A versioned python wrapper package for cmudict (https://github.com/cmusphinx/cmudict).
Stars: ✭ 41 (+57.69%)
Mutual labels:  cmudict
etymology-db
An open etymology dataset created using Wiktionary data. Contains 3.8M entries, 1.8M terms, 2900 languages, and 31 unique relationship types.
Stars: ✭ 20 (-23.08%)
Mutual labels:  wiktionary
wikdict-gen
Generation of bilingual dictionaries from Wiktionary/dbnary data for the WikDict project
Stars: ✭ 32 (+23.08%)
Mutual labels:  wiktionary
Spell4Wiki
Spell4Wiki is a mobile application to record and upload audio for Wiktionary words to Wikimedia commons. Also act as a Wiki-Dictionary.
Stars: ✭ 17 (-34.62%)
Mutual labels:  wiktionary
cmudict-tools
Tools for working with the CMU Pronunciation Dictionary
Stars: ✭ 29 (+11.54%)
Mutual labels:  cmudict
wiktionary-de-parser
Extract data from German Wiktionary XML files. Allows you to add your own extraction methods 🚀
Stars: ✭ 22 (-15.38%)
Mutual labels:  wiktionary
german-nouns
A list of ~100,000 German nouns and their grammatical properties compiled from WiktionaryDE as CSV file. Plus a module to look up the data and parse compound words.
Stars: ✭ 101 (+288.46%)
Mutual labels:  wiktionary
g2pK
g2pK: g2p module for Korean
Stars: ✭ 137 (+426.92%)
Mutual labels:  cmudict
quick-lookup
Simple GTK dictionary application powered by Wiktionary
Stars: ✭ 57 (+119.23%)
Mutual labels:  wiktionary

wikt2pron

A Wiktionary Pronunciation Collector

Build Status Documentation Status Join the chat at https://gitter.im/enwiktionary2cmudict/Lobby BSD licensed

Wikt2pron is a Python toolkit converting pronunciation in enwiktionary xml dump to cmudict format. It supports IPA and X-SAMPA format at present. This project is developed in GSoC 2017 with CMU Sphinx community.

Collected pronunciation dictionaries and related example models can be downloaded at Dropbox.

Requirements

wikt2pron requires:

Installation

# download the latest version
$ git clone https://github.com/abuccts/wikt2pron.git
$ cd wikt2pron

# install and run test
$ python setup.py install
$ python setup.py -q test

# make documents
$ make -C docs html

Usage

Extract pronunciation from Wiktionary XML dump

First, create an instance of Wiktionary class:

>>> from pywiktionary import Wiktionary
>>> wikt = Wiktionary(XSAMPA=True)

Use the example XML dump in [pywiktionary/data]:

>>> dump_file = "pywiktionary/data/enwiktionary-test-pages-articles-multistream.xml"
>>> pron = wikt.extract_IPA(dump_file)

Here's the extracted result:

>>> from pprint import pprint
>>> pprint(pron)
[{'id': 16,
  'pronunciation': {'English': [{'IPA': '/ˈdɪkʃ(ə)n(ə)ɹɪ/',
                                 'X-SAMPA': '/"dIkS(@)n(@)r\\I/',
                                 'lang': 'en'},
                                {'IPA': '/ˈdɪkʃənɛɹi/',
                                 'X-SAMPA': '/"dIkS@nEr\\i/',
                                 'lang': 'en'}]},
  'title': 'dictionary'},
 {'id': 65195,
  'pronunciation': {'English': 'IPA not found.'},
  'title': 'battleship'},
 {'id': 39478,
  'pronunciation': {'English': [{'IPA': '/ˈmɜːdə(ɹ)/',
                                 'X-SAMPA': '/"m3:d@(r\\)/',
                                 'lang': 'en'},
                                {'IPA': '/ˈmɝ.dɚ/',
                                 'X-SAMPA': '/"m3`.d@`/',
                                 'lang': 'en'}]},
  'title': 'murder'},
 {'id': 80141,
  'pronunciation': {'English': [{'IPA': '/ˈdæzəl/',
                                 'X-SAMPA': '/"d{z@l/',
                                 'lang': 'en'}]},
  'title': 'dazzle'}]
Lookup pronunciation for a word

First, create an instance of Wiktionary class:

>>> from pywiktionary import Wiktionary
>>> wikt = Wiktionary(XSAMPA=True)

Lookup a word using lookup method:

>>> word = wikt.lookup("present")

The entry of word "present" is at https://en.wiktionary.org/wiki/present, and here is the lookup result:

>>> from pprint import pprint
>>> pprint(word)
{'Catalan': 'IPA not found.',
 'Danish': [{'IPA': '/prɛsanɡ/', 'X-SAMPA': '/prEsang/', 'lang': 'da'},
            {'IPA': '[pʰʁ̥ɛˈsɑŋ]', 'X-SAMPA': '[p_hR_0E"sAN]', 'lang': 'da'}
],
 'English': [{'IPA': '/ˈpɹɛzənt/', 'X-SAMPA': '/"pr\\Ez@nt/', 'lang': 'en'},
             {'IPA': '/pɹɪˈzɛnt/', 'X-SAMPA': '/pr\\I"zEnt/', 'lang': 'en'},
             {'IPA': '/pɹəˈzɛnt/', 'X-SAMPA': '/pr\\@"zEnt/', 'lang': 'en'}],
 'Ladin': 'IPA not found.',
 'Middle French': 'IPA not found.',
 'Old French': 'IPA not found.',
 'Swedish': [{'IPA': '/preˈsent/', 'X-SAMPA': '/pre"sent/', 'lang': 'sv'}]}

To lookup a word in a certain language, specify the lang parameter:

>>> wikt = Wiktionary(lang="English", XSAMPA=True)
>>> word = wikt.lookup("read")
>>> pprint(word)
[{'IPA': '/ɹiːd/', 'X-SAMPA': '/r\\i:d/', 'lang': 'en'},
 {'IPA': '/ɹɛd/', 'X-SAMPA': '/r\\Ed/', 'lang': 'en'}]
IPA -> X-SAMPA conversion
>>> from pywiktionary import IPA
>>> IPA_text = "/t͡ʃeɪnd͡ʒ/" # en: [[change]]
>>> XSAMPA_text = IPA.IPA_to_XSAMPA(IPA_text)
>>> XSAMPA_text
"/t__SeInd__Z/"

Citation

If you use wikt2pron in your research and want to cite it, please use the following BibTeX:

@misc{xiong2017wikt2pron,
  title={Wikt2pron: A Wiktionary Pronunciation Collector},
  author={Xiong, Yifan},
  howpublished={\url{https://github.com/abuccts/wikt2pron}},
  year={2017}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].