All Categories → Compilers → tokenizer

Top 89 tokenizer open source projects

Bitextor
Bitextor generates translation memories from multilingual websites.
Js Tokens
Tiny JavaScript tokenizer.
Query Translator
Query Translator is a search query translator with AST representation
Tokenizers
Fast, Consistent Tokenization of Natural Language Text
Udpipe
R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
Lex
Replaced by foonathan/lexy
Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
Fugashi
A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.
Syntok
Text tokenization and sentence segmentation (segtok v2)
Japanesetokenizers
aim to use JapaneseTokenizer as easy as possible
Kadot
Kadot, the unsupervised natural language processing library.
Megamark
😻 Markdown with easy tokenization, a fast highlighter, and a lean HTML sanitizer
Somajo
A tokenizer and sentence splitter for German and English web and social media texts.
Djurl
Simple yet helpful library for writing Django urls by an easy, short and intuitive way.
Hippo
PHP standards checker.
Sentence Splitter
Text to sentence splitter using heuristic algorithm by Philipp Koehn and Josh Schroeder.
Cols Agent Tasks
Colin's ALM Corner Custom Build Tasks
Wirb
Ruby Object Inspection for IRB
String Calc
PHP calculator library for mathematical terms (expressions) passed as strings
Greynir
The greynir.is natural language processing website for Icelandic
Py Nltools
A collection of basic python modules for spoken natural language processing
Talismane
NLP framework: sentence detector, tokeniser, pos-tagger and dependency parser
Nlp Js Tools French
POS Tagger, lemmatizer and stemmer for french language in javascript
Omnicat Bayes
Naive Bayes text classification implementation as an OmniCat classifier strategy. (#ruby #naivebayes)
Lfuzzer
Fuzzing Parsers with Tokens
Laravel Token
Laravel token management
Lisp Esque Language
💠The Lel programming language
Snl Compiler
SNL(Small Nested Language) Compiler. Maven jUnit Tokenizer Lexer Syntax Parser. 编译原理 词法分析 语法分析
Natasha
Solves basic Russian NLP tasks, API for lower level Natasha projects
Mustard
🌭 Mustard is a Swift library for tokenizing strings when splitting by whitespace doesn't cut it.
Soynlp
한국어 자연어처리를 위한 파이썬 라이브러리입니다. 단어 추출/ 토크나이저 / 품사판별/ 전처리의 기능을 제공합니다.
Tokenizer
A small library for converting tokenized PHP source code into XML (and potentially other formats)
✭ 4,770
PHPxmltokenizer
Open Korean Text
Open Korean Text Processor - An Open-source Korean Text Processor
Smoothnlp
专注于可解释的NLP技术 An NLP Toolset With A Focus on Explainable Inference
Ekphrasis
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
Moo
Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.
Php Parser
🌿 NodeJS PHP Parser - extract AST or tokens (PHP5 and PHP7)
Jflex
The fast scanner generator for Java™ with full Unicode support
Friso
High performance Chinese tokenizer with both GBK and UTF-8 charset support based on MMSEG algorithm developed by ANSI C. Completely based on modular implementation and can be easily embedded in other programs, like: MySQL, PostgreSQL, PHP, etc.
Sentences
A multilingual command line sentence tokenizer in Golang
Sacremoses
Python port of Moses tokenizer, truecaser and normalizer
ArabicProcessingCog
A Python package that do stemming, tokenization, sentence breaking, segmentation, normalization, POS tagging for Arabic language.
Hebrew-Tokenizer
A very simple python tokenizer for Hebrew text.
cang-jie
Chinese tokenizer for tantivy, based on jieba-rs
PaddleTokenizer
使用 PaddlePaddle 实现基于深度神经网络的中文分词引擎 | A DNN Chinese Tokenizer by Using PaddlePaddle
simplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
tokenizer
Tokenize CSS according to the CSS Syntax
1-60 of 89 tokenizer projects