Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).

✭ 433

python nlp text-processing tokenizer nlp-library word-segmentation

Symspellpy

Python port of SymSpell

✭ 420

python fuzzy-search spellcheck levenshtein fuzzy-matching chinese-word-segmentation word-segmentation spell-check

Bert Multitask Learning

BERT for Multitask Learning

✭ 380

jupyter-notebook nlp transformer text-classification named-entity-recognition ner pretrained-models encoder-decoder word-segmentation

Vncorenlp

A Vietnamese natural language processing toolkit (NAACL 2018)

✭ 354

java python3 nlp natural-language-processing parsing named-entity-recognition ner pos-tagging word-segmentation

Nagisa

A Japanese tokenizer based on recurrent neural networks

✭ 260

python japanese nlp-library sequence-labeling pos-tagging word-segmentation

Jumanpp

Juman++ (a Morphological Analyzer Toolkit)

✭ 254

nlp japanese tokenizer pos-tagging part-of-speech-tagger morphological-analysis word-segmentation

hashformers

Hashformers is a framework for hashtag segmentation with transformers.

✭ 18

Jupyter Notebook python nlp natural-language-processing twitter deep-learning sentiment-analysis transformers twitter-sentiment-analysis word-segmentation sentiment-polarity sentiment-classification tweet-analysis hashtag-segmentor tweets-classification transformers-gpt2

cws-tensorflow

基于Tensorflow的中文分词模型

✭ 25

python nlp tensorflow word-segmentation

rakutenma-python

Rakuten MA (Python version)

✭ 15

python nlp chinese japanese-language word-segmentation pos-tagging part-of-speech-tagger

youtokentome-ruby

High performance unsupervised text tokenization for Ruby

✭ 17

ruby C++python unsupervised-learning word-segmentation tokenization npl bpe byte-pair-encoding

UETsegmenter

A toolkit for Vietnamese word segmentation

✭ 60

java natural-language-processing vietnamese word-segmentation

SymSpellCppPy

Fast SymSpell written in c++ and exposes to python via pybind11

✭ 28

C++python CMake spellcheck fuzzy-search fuzzy-matching spelling spell-check word-segmentation spelling-correction spelling-corrector text-segmentation pybind11 compound-words symspell

customized-symspell

Java port of SymSpell: 1 million times faster through Symmetric Delete spelling correction algorithm

✭ 51

java levenshtein-distance word-segmentation spelling-correction damerau-levenshtein spellchecker symspell qwerty-based-char-distance weighted-damerau-levenshtein

hanzi-tools

Converts from Chinese characters to pinyin, between simplified and traditional, and does word segmentation.

✭ 69

javascript pinyin traditional-chinese simplified-chinese chinese-characters word-segmentation

Pytorch-NLU

Pytorch-NLU，一个中文文本分类、序列标注工具包，支持中文长文本、短文本的多类、多标签分类任务，支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech ta…

✭ 151