Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).

Stars: ✭ 433 (+1010.26%)

Mutual labels: word-segmentation

group-transformer

Official code for Group-Transformer (Scale down Transformer by Grouping Features for a Lightweight Character-level Language Model, COLING-2020).

Stars: ✭ 21 (-46.15%)

Mutual labels: language-modeling

rakutenma-python

Rakuten MA (Python version)

Stars: ✭ 15 (-61.54%)

Mutual labels: word-segmentation

LNEx

📍 🏢 🏦 🏣 🏪 🏬 LNEx: Location Name Extractor

Stars: ✭ 21 (-46.15%)

Mutual labels: language-modeling

Lac

百度NLP：分词，词性标注，命名实体识别，词重要性

Stars: ✭ 2,792 (+7058.97%)

Mutual labels: word-segmentation

MSR2021-ProgramRepair

Code of our paper Applying CodeBERT for Automated Program Repair of Java Simple Bugs which is accepted to MSR 2021.

Stars: ✭ 26 (-33.33%)

Mutual labels: mining-software-repositories

spell

Spelling correction and string segmentation written in Go

Stars: ✭ 24 (-38.46%)

Mutual labels: word-segmentation

Toiro

A comparison tool of Japanese tokenizers

Stars: ✭ 95 (+143.59%)

Mutual labels: word-segmentation

IndRNN pytorch

Independently Recurrent Neural Networks (IndRNN) implemented in pytorch.

Stars: ✭ 112 (+187.18%)

Mutual labels: language-modeling

Pythainlp

Thai Natural Language Processing in Python.

Stars: ✭ 582 (+1392.31%)

Mutual labels: word-segmentation

SynThai

Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning

Stars: ✭ 41 (+5.13%)

Mutual labels: word-segmentation

Bert Multitask Learning

BERT for Multitask Learning

Stars: ✭ 380 (+874.36%)

Mutual labels: word-segmentation

android-source-codes

⚙️ Code analysis of common Android projects and components.

Stars: ✭ 59 (+51.28%)

Mutual labels: source-code-analysis

hashformers

Hashformers is a framework for hashtag segmentation with transformers.

Stars: ✭ 18 (-53.85%)

Mutual labels: word-segmentation

Darts

Differentiable architecture search for convolutional and recurrent networks

Stars: ✭ 3,463 (+8779.49%)

Mutual labels: language-modeling

UETsegmenter

A toolkit for Vietnamese word segmentation

Stars: ✭ 60 (+53.85%)

Mutual labels: word-segmentation

WordSegmentationDP

Word Segmentation with Dynamic Programming

Stars: ✭ 18 (-53.85%)

Mutual labels: word-segmentation

Pytorch-NLU

Pytorch-NLU，一个中文文本分类、序列标注工具包，支持中文长文本、短文本的多类、多标签分类任务，支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech ta…

Stars: ✭ 151 (+287.18%)

Mutual labels: word-segmentation

theano-recurrence

Recurrent Neural Networks (RNN, GRU, LSTM) and their Bidirectional versions (BiRNN, BiGRU, BiLSTM) for word & character level language modelling in Theano

Stars: ✭ 40 (+2.56%)

Mutual labels: language-modeling

Monpa

MONPA 罔拍是一個提供正體中文斷詞、詞性標註以及命名實體辨識的多任務模型

Stars: ✭ 203 (+420.51%)

Mutual labels: word-segmentation

spiral

A Python 3 module that provides functions for splitting identifiers found in source code files.

Stars: ✭ 37 (-5.13%)

Mutual labels: mining-software-repositories

tape-neurips2019

Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. (DEPRECATED)

Stars: ✭ 117 (+200%)

Mutual labels: language-modeling

Pycantonese

Cantonese Linguistics and NLP in Python

Stars: ✭ 147 (+276.92%)

Mutual labels: word-segmentation

JPlag

Detecting Software Plagiarism and Collusion since 1996.

Stars: ✭ 674 (+1628.21%)

Mutual labels: source-code-analysis

Kiwi

Kiwi(지능형 한국어 형태소 분석기)

Stars: ✭ 107 (+174.36%)

Mutual labels: word-segmentation

rust-code-analysis

Library to analyze and collect metrics on source code

Stars: ✭ 171 (+338.46%)

Mutual labels: source-code-analysis

Cws

Source code for an ACL2016 paper of Chinese word segmentation

Stars: ✭ 81 (+107.69%)

Mutual labels: word-segmentation

pytorch Joint-Word-Segmentation-and-POS-Tagging

Paper: A Simple and Effective Neural Model for Joint Word Segmentation and POS Tagging

Stars: ✭ 37 (-5.13%)

Mutual labels: word-segmentation

Youtokentome

Unsupervised text tokenizer focused on computational efficiency

Stars: ✭ 728 (+1766.67%)

Mutual labels: word-segmentation

SZZUnleashed

An implementation of the SZZ algorithm, i.e., an approach to identify bug-introducing commits.

Stars: ✭ 90 (+130.77%)

Mutual labels: mining-software-repositories

Sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.

Stars: ✭ 5,540 (+14105.13%)

Mutual labels: word-segmentation

skt

Sanskrit compound segmentation using seq2seq model

Stars: ✭ 21 (-46.15%)

Mutual labels: word-segmentation

Symspellpy

Python port of SymSpell

Stars: ✭ 420 (+976.92%)

Mutual labels: word-segmentation

get-source

Fetch source-mapped sources. Peek by file, line, column. Node & browsers. Sync & async.

Stars: ✭ 26 (-33.33%)

Mutual labels: source-code-analysis

Vncorenlp

A Vietnamese natural language processing toolkit (NAACL 2018)

Stars: ✭ 354 (+807.69%)

Mutual labels: word-segmentation

referit3d

Code accompanying our ECCV-2020 paper on 3D Neural Listeners.

Stars: ✭ 59 (+51.28%)

Mutual labels: language-modeling

Jumanpp

Juman++ (a Morphological Analyzer Toolkit)

Stars: ✭ 254 (+551.28%)

Mutual labels: word-segmentation

esapp

An unsupervised Chinese word segmentation tool.

Stars: ✭ 13 (-66.67%)

Mutual labels: word-segmentation

cws-tensorflow

基于Tensorflow的中文分词模型

Stars: ✭ 25 (-35.9%)

Mutual labels: word-segmentation

ckipnlp

CKIP CoreNLP Toolkits

Stars: ✭ 92 (+135.9%)

Mutual labels: word-segmentation

youtokentome-ruby

High performance unsupervised text tokenization for Ruby

Stars: ✭ 17 (-56.41%)

Mutual labels: word-segmentation

Babler

Data Collection System For NLP/Speech Recognition

Stars: ✭ 21 (-46.15%)

Mutual labels: language-modeling

SymSpellCppPy

Fast SymSpell written in c++ and exposes to python via pybind11

Stars: ✭ 28 (-28.21%)

Mutual labels: word-segmentation

pytorch-translm

An implementation of transformer-based language model for sentence rewriting tasks such as summarization, simplification, and grammatical error correction.

Stars: ✭ 22 (-43.59%)

Mutual labels: language-modeling

hanzi-tools

Converts from Chinese characters to pinyin, between simplified and traditional, and does word segmentation.

Stars: ✭ 69 (+76.92%)

Mutual labels: word-segmentation

mozolm

MozoLM: A language model (LM) serving library

Stars: ✭ 32 (-17.95%)

Mutual labels: language-modeling

dnn-lstm-word-segment

Chinese Word Segmention Base on the Deep Learning and LSTM Neural Network