All Projects → arbox → tokenizer

arbox / tokenizer

Licence: other
A simple tokenizer in Ruby for NLP tasks.

Programming Languages

ruby
36898 projects - #4 most used programming language

Projects that are alternatives of or similar to tokenizer

Query Translator
Query Translator is a search query translator with AST representation
Stars: ✭ 165 (+275%)
Mutual labels:  tokenizer
Tokenizer
A tokenizer for Icelandic text
Stars: ✭ 27 (-38.64%)
Mutual labels:  tokenizer
python-mecab
A repository to bind mecab for Python 3.5+. Not using swig nor pybind. (Not Maintained Now)
Stars: ✭ 27 (-38.64%)
Mutual labels:  tokenizer
Bitextor
Bitextor generates translation memories from multilingual websites.
Stars: ✭ 168 (+281.82%)
Mutual labels:  tokenizer
grasp
Essential NLP & ML, short & fast pure Python code
Stars: ✭ 58 (+31.82%)
Mutual labels:  tokenizer
suika
Suika 🍉 is a Japanese morphological analyzer written in pure Ruby
Stars: ✭ 31 (-29.55%)
Mutual labels:  tokenizer
Udpipe
R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
Stars: ✭ 160 (+263.64%)
Mutual labels:  tokenizer
lindera
A morphological analysis library.
Stars: ✭ 226 (+413.64%)
Mutual labels:  tokenizer
lexertk
C++ Lexer Toolkit Library (LexerTk) https://www.partow.net/programming/lexertk/index.html
Stars: ✭ 26 (-40.91%)
Mutual labels:  tokenizer
xontrib-output-search
Get identifiers, paths, URLs and words from the previous command output and use them for the next command in xonsh shell.
Stars: ✭ 26 (-40.91%)
Mutual labels:  tokenizer
greeb
Greeb is a simple Unicode-aware regexp-based tokenizer.
Stars: ✭ 16 (-63.64%)
Mutual labels:  tokenizer
Roy VnTokenizer
Vietnamese tokenizer (Maximum Matching and CRF)
Stars: ✭ 49 (+11.36%)
Mutual labels:  tokenizer
chinese-tokenizer
Tokenizes Chinese texts into words.
Stars: ✭ 72 (+63.64%)
Mutual labels:  tokenizer
Js Tokens
Tiny JavaScript tokenizer.
Stars: ✭ 166 (+277.27%)
Mutual labels:  tokenizer
gd-tokenizer
A small godot project with a tokenizer written in GDScript.
Stars: ✭ 34 (-22.73%)
Mutual labels:  tokenizer
Tokenizers
Fast, Consistent Tokenization of Natural Language Text
Stars: ✭ 161 (+265.91%)
Mutual labels:  tokenizer
Text-Classification-LSTMs-PyTorch
The aim of this repository is to show a baseline model for text classification by implementing a LSTM-based model coded in PyTorch. In order to provide a better understanding of the model, it will be used a Tweets dataset provided by Kaggle.
Stars: ✭ 45 (+2.27%)
Mutual labels:  tokenizer
alexa-ruby
Ruby toolkit for Amazon Alexa service
Stars: ✭ 17 (-61.36%)
Mutual labels:  rubynlp
SwiLex
A universal lexer library in Swift.
Stars: ✭ 29 (-34.09%)
Mutual labels:  tokenizer
snapdragon-lexer
Converts a string into an array of tokens, with useful methods for looking ahead and behind, capturing, matching, et cetera.
Stars: ✭ 19 (-56.82%)
Mutual labels:  tokenizer

Tokenizer

RubyGems | Homepage | Source Code | Bug Tracker

Gem Version Build Status Code Climate Dependency Status

DESCRIPTION

A simple multilingual tokenizer – a linguistic tool intended to split a written text into tokens for NLP tasks. This tool provides a CLI and a library for linguistic tokenization which is an anavoidable step for many HLT (Human Language Technology) tasks in the preprocessing phase for further syntactic, semantic and other higher level processing goals.

Tokenization task involves Sentence Segmentation, Word Segmentation and Boundary Disambiguation for the both tasks.

Use it for tokenization of German, English and Dutch texts.

Implemented Algorithms

to be …

INSTALLATION

Tokenizer is provided as a .gem package. Simply install it via RubyGems.

To install tokenizer issue the following command:

$ gem install tokenizer

If you want to do a system wide installation, do this as root (possibly using sudo).

Alternatively use your Gemfile for dependency management.

SYNOPSIS

You can use Tokenizer in two ways.

  • As a command line tool:

    $ echo 'Hi, ich gehe in die Schule!. | tokenize
  • As a library for embedded tokenization:

    > require 'tokenizer'
    > de_tokenizer = Tokenizer::WhitespaceTokenizer.new
    > de_tokenizer.tokenize('Ich gehe in die Schule!')
    > => ["Ich", "gehe", "in", "die", "Schule", "!"]
  • Customizable PRE and POST list

    > require 'tokenizer'
    > de_tokenizer = Tokenizer::WhitespaceTokenizer.new(:de, { post: Tokenizer::Tokenizer::POST + ['|'] })
    > de_tokenizer.tokenize('Ich gehe|in die Schule!')
    > => ["Ich", "gehe", "|in", "die", "Schule", "!"]

See documentation in the Tokenizer::WhitespaceTokenizer class for details on particular methods.

SUPPORT

If you have question, bug reports or any suggestions, please drop me an email :) Any help is deeply appreciated!

CHANGELOG

For details on future plan and working progress see CHANGELOG.rdoc.

CAUTION

This library is work in process! Though the interface is mostly complete, you might face some not implemented features.

Please contact me with your suggestions, bug reports and feature requests.

LICENSE

Tokenizer is a copyrighted software by Andrei Beliankou, 2011-

You may use, redistribute and change it under the terms provided in the LICENSE.rdoc file.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].