Top 89 tokenizer open source projects

Rust re-implementation of OpenFST - library for constructing, combining, optimizing, and searching weighted finite-state transducers (FSTs). A Python binding is also available.

✭ 104

rust python C++automata graph tokenizer composition speech-recognition transducers kaldi transducer asr rust-crate fst openfst shortest-path finite-state-transducers kaldi-asr wfst finite-state-acceptors fsts

psr2r-sniffer

A PSR-2-R code sniffer and code-style auto-correction-tool - including many useful additions

✭ 32

PHP tokenizer sniffer code-style fixer

lex

Lex is an implementation of lex tool in Ruby.

✭ 49

ruby ruby-gem compiler tokenizer lexer lexing state-lexer

hunspell

High-Performance Stemmer, Tokenizer, and Spell Checker for R

✭ 101

C++r tokenizer rstats spell-check stemmer r-package hunspell spellchecker

tokenizer

A simple tokenizer in Ruby for NLP tasks.

✭ 44

ruby nlp natural-language-processing tokenizer rubynlp

lindera

A morphological analysis library.

✭ 226

rust Makefile Dockerfile library tokenizer analyzer hacktoberfest morphological

SwiLex

A universal lexer library in Swift.

✭ 29

swift tokenizer swift-package-manager lexer lexical-analysis syntax-analysis

gd-tokenizer

A small godot project with a tokenizer written in GDScript.

✭ 34

GDScript terminal tokenizer godotengine godot

python-mecab

A repository to bind mecab for Python 3.5+. Not using swig nor pybind. (Not Maintained Now)

✭ 27

C++python perl tokenizer text-processing mecab text-preprocessing python-c-extension

xontrib-output-search

Get identifiers, paths, URLs and words from the previous command output and use them for the next command in xonsh shell.

✭ 26

python cli console terminal command-line tokenizer tokenization xonsh xontrib

snapdragon-lexer

Converts a string into an array of tokens, with useful methods for looking ahead and behind, capturing, matching, et cetera.

✭ 19

javascript nodejs node parse tokenizer lexer token snapdragon jonschlinkert doowb

chinese-tokenizer

Tokenizes Chinese texts into words.

✭ 72

javascript tokenizer chinese words

suika

Suika 🍉 is a Japanese morphological analyzer written in pure Ruby

✭ 31

ruby nlp tokenizer morphological-analysis postagger

Text-Classification-LSTMs-PyTorch

The aim of this repository is to show a baseline model for text classification by implementing a LSTM-based model coded in PyTorch. In order to provide a better understanding of the model, it will be used a Tweets dataset provided by Kaggle.

✭ 45