The aim of this repository is to show a baseline model for text classification by implementing a LSTM-based model coded in PyTorch. In order to provide a better understanding of the model, it will be used a Tweets dataset provided by Kaggle.

Stars: ✭ 45 (+2.27%)

Mutual labels: tokenizer

alexa-ruby

Ruby toolkit for Amazon Alexa service

Stars: ✭ 17 (-61.36%)

Mutual labels: rubynlp

SwiLex

A universal lexer library in Swift.

Stars: ✭ 29 (-34.09%)

Mutual labels: tokenizer

snapdragon-lexer

Converts a string into an array of tokens, with useful methods for looking ahead and behind, capturing, matching, et cetera.

Stars: ✭ 19 (-56.82%)

Mutual labels: tokenizer

View All Similar Projects ➔

Tokenizer¶ ↑

RubyGems | Homepage | Source Code | Bug Tracker

DESCRIPTION¶ ↑

A simple multilingual tokenizer – a linguistic tool intended to split a written text into tokens for NLP tasks. This tool provides a CLI and a library for linguistic tokenization which is an anavoidable step for many HLT (Human Language Technology) tasks in the preprocessing phase for further syntactic, semantic and other higher level processing goals.

Tokenization task involves Sentence Segmentation, Word Segmentation and Boundary Disambiguation for the both tasks.

Use it for tokenization of German, English and Dutch texts.

Implemented Algorithms¶ ↑

to be …

INSTALLATION¶ ↑

Tokenizer is provided as a .gem package. Simply install it via RubyGems.

To install tokenizer issue the following command:

$ gem install tokenizer

If you want to do a system wide installation, do this as root (possibly using sudo).

Alternatively use your Gemfile for dependency management.

SYNOPSIS¶ ↑

You can use Tokenizer in two ways.

As a command line tool:

$ echo 'Hi, ich gehe in die Schule!. | tokenize

As a library for embedded tokenization:

> require 'tokenizer'
> de_tokenizer = Tokenizer::WhitespaceTokenizer.new
> de_tokenizer.tokenize('Ich gehe in die Schule!')
> => ["Ich", "gehe", "in", "die", "Schule", "!"]

Customizable PRE and POST list

> require 'tokenizer'
> de_tokenizer = Tokenizer::WhitespaceTokenizer.new(:de, { post: Tokenizer::Tokenizer::POST + ['|'] })
> de_tokenizer.tokenize('Ich gehe|in die Schule!')
> => ["Ich", "gehe", "|in", "die", "Schule", "!"]

See documentation in the Tokenizer::WhitespaceTokenizer class for details on particular methods.

SUPPORT¶ ↑

If you have question, bug reports or any suggestions, please drop me an email :) Any help is deeply appreciated!

CHANGELOG¶ ↑

For details on future plan and working progress see CHANGELOG.rdoc.

CAUTION¶ ↑

This library is work in process! Though the interface is mostly complete, you might face some not implemented features.

Please contact me with your suggestions, bug reports and feature requests.

LICENSE¶ ↑

Tokenizer is a copyrighted software by Andrei Beliankou, 2011-

You may use, redistribute and change it under the terms provided in the LICENSE.rdoc file.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

arbox / tokenizer

Programming Languages

Labels

Projects that are alternatives of or similar to tokenizer

Tokenizer¶ ↑

DESCRIPTION¶ ↑

Implemented Algorithms¶ ↑

INSTALLATION¶ ↑

SYNOPSIS¶ ↑

SUPPORT¶ ↑

CHANGELOG¶ ↑

CAUTION¶ ↑

LICENSE¶ ↑