All Projects → CUNY-CL → wikipron

CUNY-CL / wikipron

Licence: Apache-2.0 License
Massively multilingual pronunciation mining

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to wikipron

Wikipron
Massively multilingual pronunciation mining
Stars: ✭ 99 (-40.72%)
Mutual labels:  speech, linguistics, python-api
Onset
A language evolution simulator, using realistic phonetic changes.
Stars: ✭ 30 (-82.04%)
Mutual labels:  linguistics, phonology, phonetics
linguistics problems
Natural language processing in examples and games
Stars: ✭ 23 (-86.23%)
Mutual labels:  linguistics, computational-linguistics
Britfone
British English pronunciation dictionary
Stars: ✭ 66 (-60.48%)
Mutual labels:  pronunciation, phonetics
pylangacq
Language Acquisition Research Tools
Stars: ✭ 33 (-80.24%)
Mutual labels:  linguistics, computational-linguistics
folia
FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for proces…
Stars: ✭ 56 (-66.47%)
Mutual labels:  linguistics, computational-linguistics
dev
PHOIBLE data and development.
Stars: ✭ 90 (-46.11%)
Mutual labels:  linguistics, phonology
kaldi helpers
🙊 A set of scripts to use in preparing a corpus for speech-to-text processing with the Kaldi Automatic Speech Recognition Library.
Stars: ✭ 13 (-92.22%)
Mutual labels:  speech, computational-linguistics
fade
A Simulation Framework for Auditory Discrimination Experiments
Stars: ✭ 12 (-92.81%)
Mutual labels:  speech
LIUM
Scripts for LIUM SpkDiarization tools
Stars: ✭ 28 (-83.23%)
Mutual labels:  speech
big-phoney
Get phonetic spellings and syllable counts for any english word. Works with made-up and non-dictionary words
Stars: ✭ 65 (-61.08%)
Mutual labels:  phonetics
clinical nlp elastic
Clinical NLP Analysis with Elasticsearch and Kibana
Stars: ✭ 32 (-80.84%)
Mutual labels:  linguistics
pydds
Python API for DDS
Stars: ✭ 22 (-86.83%)
Mutual labels:  python-api
MelNet-SpeechGeneration
Implementation of MelNet in PyTorch to generate high-fidelity audio samples
Stars: ✭ 19 (-88.62%)
Mutual labels:  speech
spokestack-android
Extensible Android mobile voice framework: wakeword, ASR, NLU, and TTS. Easily add voice to any Android app!
Stars: ✭ 52 (-68.86%)
Mutual labels:  speech
nabaztag-php
a simple php implementation of a Nabaztag server
Stars: ✭ 14 (-91.62%)
Mutual labels:  speech
HTK
The Hidden Markov Model Toolkit (HTK) from University of Cambridge, with fixed issues.
Stars: ✭ 23 (-86.23%)
Mutual labels:  speech
japanese-pitch-accent-resources
Trying to consolidate japanese phonetic, and in particular pitch accent resources into one list
Stars: ✭ 64 (-61.68%)
Mutual labels:  phonetics
mystem
CGo bindings to Yandex.Mystem
Stars: ✭ 28 (-83.23%)
Mutual labels:  linguistics
speech recognition ctc
Use ctc to do chinese speech recognition by keras / 通过keras和ctc实现中文语音识别
Stars: ✭ 40 (-76.05%)
Mutual labels:  speech

WikiPron

PyPI version Supported Python versions CircleCI Paper Conference

WikiPron is a command-line tool and Python API for mining multilingual pronunciation data from Wiktionary, as well as a database of pronunciation dictionaries mined using this tool.

If you use WikiPron in your research, please cite the following:

Jackson L. Lee, Lucas F.E. Ashby, M. Elizabeth Garza, Yeonju Lee-Sikka, Sean Miller, Alan Wong, Arya D. McCarthy, and Kyle Gorman (2020). Massively multilingual pronunciation mining with WikiPron. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4223-4228. [bibtex]

Command-line tool

Installation

WikiPron requires Python 3.6+. It is available from PyPI:

pip install wikipron

Usage

Quick Start

After installation, the terminal command wikipron will be available. As a basic example, the following command scrapes G2P data for French:

wikipron fra

Specifying the Language

The language is indicated by a three-letter ISO 639-2 or ISO 639-3 language code, e.g., fra for French. For which languages can be scraped, here is the complete list of languages on Wiktionary that have pronunciation entries.

Specifying the Dialect

One can optionally specify dialects to target using the --dialect flag. The dialect name can be found together with the transcription on Wiktionary. For example, "(UK, US) IPA: /təˈmɑːtəʊ/". To restrict to the union of dialects use the pipe character '|': e.g., --dialect='General American | US'. Transcriptions which lack a dialect specification are selected regardless of the value of this flag.

Segmentation

By default, the segments library is used to segment the transcription into whitespace. The segmentation tends to place IPA diacritics and modifiers on the "parent" symbol. For instance, [kʰæt] is rendered kʰ æ t. This can be disabled using the --no-segment flag.

Parentheses

Some of transcriptions contain parentheses to indicate alternative pronunciations. The parentheses (but not the content) are discarded in the scrape unless the --no-skip-parens flag is used.

Output

The scraped data is organized with each <word, pronunciation> pair on its own line, where the word and pronunciation are separated by a tab. Note that the pronunciation is in International Phonetic Alphabet (IPA), segmented by spaces that correctly handle the combining and modifier diacritics for modeling purposes, e.g., we have kʰ æ t with the aspirated k instead of k ʰ æ t.

For illustration, here is a snippet of French data scraped by WikiPron:

accrémentitielle    a k ʁ e m ɑ̃ t i t j ɛ l
accrescent  a k ʁ ɛ s ɑ̃
accrétion   a k ʁ e s j ɔ̃
accrétions  a k ʁ e s j ɔ̃

By default, the scraped data appears in the terminal. To save the data in a TSV file, please redirect the standard output to a filename of your choice:

wikipron fra > fra.tsv

Advanced Options

The wikipron terminal command has an array of options to configure your scraping run. For a full list of the options, please run wikipron -h.

Python API

The underlying module can also be used from Python. A standard workflow looks like:

import wikipron

config = wikipron.Config(key="fra")  # French, with default options.
for word, pron in wikipron.scrape(config):
    ...

Data

We also make available a database of over 3 million word/pronunciation pairs mined using WikiPron.

Models

We host grapheme-to-phoneme models and modeling software in a separate repository.

Development

Repository

The source code of WikiPron is hosted on GitHub at https://github.com/CUNY-CL/wikipron, where development also happens.

For the latest changes not yet released through pip or working on the codebase yourself, you may obtain the latest source code through GitHub and git:

  1. Create a fork of the wikipron repo on your GitHub account.

  2. Locally, make sure you are in some sort of a virtual environment (venv, virtualenv, conda, etc).

  3. Download and install the library in the "editable" mode together with the core and dev dependencies within the virtual environment:

    git clone https://github.com/<your-github-username>/wikipron.git
    cd wikipron
    pip install -U pip setuptools
    pip install -r requirements.txt
    pip install --no-deps -e .

We keep track of notable changes in CHANGELOG.md.

Contribution

For questions, bug reports, and feature requests, please file an issue.

If you would like to contribute to the wikipron codebase, please see CONTRIBUTING.md.

License

WikiPron is released under an Apache 2.0 license. Please see LICENSE.txt for details.

Please note that Wiktionary data in the data/ directory has its own licensing terms.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].