Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → jacksonllee → Pycantonese

jacksonllee / Pycantonese

Licence: mit

Cantonese Linguistics and NLP in Python

Programming Languages

139335 projects - #7 most used programming language

Labels

nlp natural-language-processing linguistics word-segmentation

Projects that are alternatives of or similar to Pycantonese

Weixin public corpus

微信公众号语料库

Stars: ✭ 465 (+216.33%)

Mutual labels: natural-language-processing, linguistics

A Vietnamese natural language processing toolkit (NAACL 2018)

Stars: ✭ 354 (+140.82%)

Mutual labels: natural-language-processing, word-segmentation

PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotation).

Stars: ✭ 426 (+189.8%)

Mutual labels: natural-language-processing, linguistics

Unsupervised text tokenizer for Neural Network-based text generation.

Stars: ✭ 5,540 (+3668.71%)

Mutual labels: natural-language-processing, word-segmentation

Thai Natural Language Processing in Python.

Stars: ✭ 582 (+295.92%)

Mutual labels: natural-language-processing, word-segmentation

Unsupervised text tokenizer focused on computational efficiency

Stars: ✭ 728 (+395.24%)

Mutual labels: natural-language-processing, word-segmentation

NLTK Data

Stars: ✭ 675 (+359.18%)

Mutual labels: natural-language-processing, linguistics

A comparison tool of Japanese tokenizers

Stars: ✭ 95 (-35.37%)

Mutual labels: natural-language-processing, word-segmentation

Pre-Trained Models for ToD-BERT

Stars: ✭ 143 (-2.72%)

Mutual labels: natural-language-processing

Googlelanguager

R client for the Google Translation API, Google Cloud Natural Language API and Google Cloud Speech API

Stars: ✭ 145 (-1.36%)

Mutual labels: natural-language-processing

📚Survey of previous research and related works on machine learning (especially Deep Learning) in Japanese

Stars: ✭ 140 (-4.76%)

Mutual labels: natural-language-processing

Code for the ACL 2018 paper "Neural Document Summarization by Jointly Learning to Score and Select Sentences"

Stars: ✭ 143 (-2.72%)

Mutual labels: natural-language-processing

阿里天池首届中文NL2SQL挑战赛top6

Stars: ✭ 146 (-0.68%)

Mutual labels: natural-language-processing

Stanford NLP group's shared Python tools.

Stars: ✭ 142 (-3.4%)

Mutual labels: natural-language-processing

A Ruby natural language processor.

Stars: ✭ 146 (-0.68%)

Mutual labels: natural-language-processing

Data augmentation for NLP

Stars: ✭ 2,761 (+1778.23%)

Mutual labels: natural-language-processing

Practical Machine Learning With Python

Master the essential skills needed to recognize and solve complex real-world problems with Machine Learning and Deep Learning by leveraging the highly popular Python Machine Learning Eco-system.

Stars: ✭ 1,868 (+1170.75%)

Mutual labels: natural-language-processing

Tree Transformer

Implementation of the paper Tree Transformer

Stars: ✭ 148 (+0.68%)

Mutual labels: natural-language-processing

Fxdesktopsearch

A JavaFX based desktop search application.

Stars: ✭ 147 (+0%)

Mutual labels: natural-language-processing

互联网大厂面试经验

Stars: ✭ 145 (-1.36%)

Mutual labels: natural-language-processing

View All Similar Projects ➔

PyCantonese: Cantonese Linguistics and NLP in Python

.. start-raw-directive

.. raw:: html

<img src="https://jacksonllee.com/logos/pycantonese-logo.png" width="250px">

.. end-raw-directive

Full Documentation: https://pycantonese.org

|

.. image:: https://badge.fury.io/py/pycantonese.svg :target: https://pypi.python.org/pypi/pycantonese :alt: PyPI version

.. image:: https://img.shields.io/pypi/pyversions/pycantonese.svg :target: https://pypi.python.org/pypi/pycantonese :alt: Supported Python versions

.. image:: https://circleci.com/gh/jacksonllee/pycantonese/tree/master.svg?style=svg :target: https://circleci.com/gh/jacksonllee/pycantonese/tree/master :alt: Build

|

.. start-sphinx-website-index-page

PyCantonese is a Python library for Cantonese linguistics and natural language processing (NLP). Currently implemented features (more to come!):

Accessing and searching corpus data
Parsing and conversion tools for Jyutping romanization
Stop words
Word segmentation
Part-of-speech tagging

Quick Examples

With PyCantonese imported:

.. code-block:: python

>>> import pycantonese

Word segmentation

.. code-block:: python

>>> pycantonese.segment("廣東話好難學？")  # Is Cantonese difficult to learn?
['廣東話', '好', '難', '學', '？']

Conversion from Cantonese characters to Jyutping

.. code-block:: python

>>> pycantonese.characters_to_jyutping('香港人講廣東話')  # Hongkongers speak Cantonese
[("香港人", "hoeng1gong2jan4"), ("講", "gong2"), ("廣東話", "gwong2dung1waa2")]

Finding all verbs in the HKCanCor corpus

In this example, we search for the regular expression '^V' for all words whose part-of-speech tag begins with "V" in the original HKCanCor annotations:

.. code-block:: python

>>> corpus = pycantonese.hkcancor() # get HKCanCor
>>> all_verbs = corpus.search(pos='^V')
>>> len(all_verbs)  # number of all verbs
29726
>>> all_verbs[:10]  # print 10 results
[Token(word='去', pos='V', jyutping='heoi3', mor=None, gra=None),
 Token(word='去', pos='V', jyutping='heoi3', mor=None, gra=None),
 Token(word='旅行', pos='VN', jyutping='leoi5hang4', mor=None, gra=None),
 Token(word='有冇', pos='V1', jyutping='jau5mou5', mor=None, gra=None),
 Token(word='要', pos='VU', jyutping='jiu3', mor=None, gra=None),
 Token(word='有得', pos='VU', jyutping='jau5dak1', mor=None, gra=None),
 Token(word='冇得', pos='VU', jyutping='mou5dak1', mor=None, gra=None),
 Token(word='去', pos='V', jyutping='heoi3', mor=None, gra=None),
 Token(word='係', pos='V', jyutping='hai6', mor=None, gra=None),
 Token(word='係', pos='V', jyutping='hai6', mor=None, gra=None)]

Parsing Jyutping for the onset, nucleus, coda, and tone

.. code-block:: python

>>> pycantonese.parse_jyutping('gwong2dung1waa2')  # 廣東話
[Jyutping(onset='gw', nucleus='o', coda='ng', tone='2'),
 Jyutping(onset='d', nucleus='u', coda='ng', tone='1'),
 Jyutping(onset='w', nucleus='aa', coda='', tone='2')]

Download and Install

To download and install the stable, most recent version::

$ pip install --upgrade pycantonese

To test your installation in the Python interpreter:

.. code-block:: python

>>> import pycantonese
>>> pycantonese.__version__  # show version number

Links

Source code: https://github.com/jacksonllee/pycantonese
Bug tracker, feature requests: https://github.com/jacksonllee/pycantonese/issues
Email: Please contact Jackson Lee <https://jacksonllee.com>_.
Social media: Facebook <https://www.facebook.com/pycantonese>_ and Twitter <https://twitter.com/pycantonese>_

How to Cite

PyCantonese is authored and mainteined by Jackson L. Lee <https://jacksonllee.com>_.

A talk introducing PyCantonese:

Lee, Jackson L. 2015. PyCantonese: Cantonese linguistic research in the age of big data. Talk at the Childhood Bilingualism Research Centre, Chinese University of Hong Kong. September 15. 2015. Notes+slides <https://pycantonese.org/papers/Lee-pycantonese-2015.html>_

License

MIT License. Please see LICENSE.txt in the GitHub source code for details.

The HKCanCor dataset included in PyCantonese is substantially modified from its source in terms of format. The original dataset has a CC BY license. Please see pycantonese/data/hkcancor/README.md in the GitHub source code for details.

The rime-cantonese data (release 2020.09.09) is incorporated into PyCantonese for word segmentation and characters-to-Jyutping conversion. This data has a CC BY 4.0 license. Please see pycantonese/data/rime_cantonese/README.md in the GitHub source code for details.

Logo

The PyCantonese logo is the Chinese character 粵 meaning Cantonese, with artistic design by albino.snowman (Instagram handle).

Acknowledgments

Wonderful resources with a permissive license that have been incorporated into PyCantonese:

HKCanCor
rime-cantonese

Individuals who have contributed feedback, bug reports, etc. (in alphabetical order of last names):

@cathug
Litong Chen
Jenny Chim
@g-traveller
Rachel Han
Ryan Lai
Charles Lam
Hill Ma
@richielo
@rylanchiu
Stephan Stiller
Tsz-Him Tsui
Robin Yuen

.. end-sphinx-website-index-page

Changelog

Please see CHANGELOG.md.

Setting up a Development Environment

The latest code under development is available on Github at jacksonllee/pycantonese <https://github.com/jacksonllee/pycantonese>. You need to have Git LFS <https://git-lfs.github.com/> installed on your system. To obtain this version for experimental features or for development:

.. code-block:: bash

$ git clone https://github.com/jacksonllee/pycantonese.git $ cd pycantonese $ git lfs pull $ pip install -r dev-requirements.txt $ pip install -e .

To run tests and styling checks:

.. code-block:: bash

$ pytest -vv --doctest-modules --cov=pycantonese pycantonese docs $ flake8 pycantonese $ black --check pycantonese

To build the documentation website files:

.. code-block:: bash

$ python build_docs.py

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 147

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗