winkjs / wink-tokenizer

Licence: MIT license

Multilingual tokenizer that automatically tags each token with its type

Programming Languages

184084 projects - #8 most used programming language

Projects that are alternatives of or similar to wink-tokenizer

Get identifiers, paths, URLs and words from the previous command output and use them for the next command in xonsh shell.

Stars: ✭ 26 (-49.02%)

Mutual labels: tokenizer, tokenization

YouTube to m3u

Grab .m3u8 from YouTube live channels and makes .m3u IPTV Playlist from various languages and Events. Tamil / Malayalam / English / Hindi / French / Kids / Sports / Urudu etc.

Stars: ✭ 48 (-5.88%)

Mutual labels: french, hindi

SoMeWeTa

A part-of-speech tagger with support for domain adaptation and external resources.

Stars: ✭ 20 (-60.78%)

Mutual labels: german, french

eczar

Eczar: fonts for Devanagari and Latin

Stars: ✭ 52 (+1.96%)

Mutual labels: latin, devanagari

React Input Tags

React component for tagging inputs.

Stars: ✭ 10 (-80.39%)

Mutual labels: tokenizer, tagging

simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

Stars: ✭ 32 (-37.25%)

Mutual labels: tokenizer, tokenization

pH7-Internationalization

🎌 pH7CMS Internationalization (I18N) package 🙊 Get new languages for your pH7CMS website!

Stars: ✭ 17 (-66.67%)

Mutual labels: multilingual, french

lima

The Libre Multilingual Analyzer, a Natural Language Processing (NLP) C++ toolkit.

Stars: ✭ 75 (+47.06%)

Mutual labels: multilingual, tokenization

language-detector

Detect the language of text

Stars: ✭ 28 (-45.1%)

Mutual labels: german, french

neural tokenizer

Tokenize English sentences using neural networks.

Stars: ✭ 64 (+25.49%)

Mutual labels: tokenizer

theolog-ss2017

Notizen zur TheoLog-Vorlesung mit Begriffen aus Formale Systeme. Hinweis: die Unterlagen sind für die VL in 2017 und können Fehler enthalten

Stars: ✭ 18 (-64.71%)

Mutual labels: german

farasapy

A Python implementation of Farasa toolkit

Stars: ✭ 69 (+35.29%)

Mutual labels: tokenizer

grav-plugin-langswitcher

Grav LangSwitcher Plugin

Stars: ✭ 22 (-56.86%)

Mutual labels: multilingual

microalg

Langage et environnements dédiés à l’algorithmique.

Stars: ✭ 12 (-76.47%)

Mutual labels: french

preact-token-input

🔖 A text field that tokenizes input, for things like tags.

Stars: ✭ 57 (+11.76%)

Mutual labels: tagging

deep-learning-german-tts

Thorsten-Voice: A free to use, offline working, high quality german TTS voice should be available for every project without any license struggling.

Stars: ✭ 268 (+425.49%)

Mutual labels: german

french

French language pack to localize the Flarum forum software plus its official and third-party extensions.

Stars: ✭ 17 (-66.67%)

Mutual labels: french

HistoryOfMe

Your own personal diary.

Stars: ✭ 50 (-1.96%)

Mutual labels: german

additional tags

Redmine Plugin for adding tags functionality to issues and wiki pages.

Stars: ✭ 25 (-50.98%)

Mutual labels: tagging

next-multilingual

An opinionated end-to-end solution for Next.js applications that requires multiple languages.

Stars: ✭ 135 (+164.71%)

Mutual labels: multilingual

View All Similar Projects ➔

wink-tokenizer

Multilingual tokenizer that automatically tags each token with its type

Tokenize sentences in Latin and Devanagari scripts using wink-tokenizer. Some of it's top feature are outlined below:

Support for English, French, German, Hindi, Sanskrit, Marathi and many more.
Intelligent tokenization of sentence containing words in more than one language.
Automatic detection & tagging of different types of tokens based on their features:
- These include word, punctuation, email, mention, hashtag, emoticon, and emoji etc.
- User definable token types.
High performance – tokenizes a typical english sentence at speed of over 2.4 million tokens/second and a complex tweet containing hashtags, emoticons, emojis, mentions, e-mail at a speed of over 1.5 million tokens/second (benchmarked on 2.2 GHz Intel Core i7 machine with 16GB RAM).

Installation

Use npm to install:

npm install wink-tokenizer --save

Getting Started

// Load tokenizer.
var tokenizer = require( 'wink-tokenizer' );
// Create it's instance.
var myTokenizer = tokenizer();

// Tokenize a tweet.
var s = '@superman: hit me up on my email [email protected], 2 of us plan party🎉 tom at 3pm:) #fun';
myTokenizer.tokenize( s );
// -> [ { value: '@superman', tag: 'mention' },
//      { value: ':', tag: 'punctuation' },
//      { value: 'hit', tag: 'word' },
//      { value: 'me', tag: 'word' },
//      { value: 'up', tag: 'word' },
//      { value: 'on', tag: 'word' },
//      { value: 'my', tag: 'word' },
//      { value: 'email', tag: 'word' },
//      { value: '[email protected]', tag: 'email' },
//      { value: ',', tag: 'punctuation' },
//      { value: '2', tag: 'number' },
//      { value: 'of', tag: 'word' },
//      { value: 'us', tag: 'word' },
//      { value: 'plan', tag: 'word' },
//      { value: 'party', tag: 'word' },
//      { value: '🎉', tag: 'emoji' },
//      { value: 'tom', tag: 'word' },
//      { value: 'at', tag: 'word' },
//      { value: '3pm', tag: 'time' },
//      { value: ':)', tag: 'emoticon' },
//      { value: '#fun', tag: 'hashtag' } ]

// Tokenize a French sentence.
s = 'Mieux vaut prévenir que guérir:-)';
myTokenizer.tokenize( s );
// -> [ { value: 'Mieux', tag: 'word' },
//      { value: 'vaut', tag: 'word' },
//      { value: 'prévenir', tag: 'word' },
//      { value: 'que', tag: 'word' },
//      { value: 'guérir', tag: 'word' },
//      { value: ':-)', tag: 'emoticon' } ]

// Tokenize a sentence containing Hindi and English.
s = 'द्रविड़ ने टेस्ट में ३६ शतक जमाए, उनमें 21 विदेशी playground पर हैं।';
myTokenizer.tokenize( s );
// -> [ { value: 'द्रविड़', tag: 'word' },
//      { value: 'ने', tag: 'word' },
//      { value: 'टेस्ट', tag: 'word' },
//      { value: 'में', tag: 'word' },
//      { value: '३६', tag: 'number' },
//      { value: 'शतक', tag: 'word' },
//      { value: 'जमाए', tag: 'word' },
//      { value: ',', tag: 'punctuation' },
//      { value: 'उनमें', tag: 'word' },
//      { value: '21', tag: 'number' },
//      { value: 'विदेशी', tag: 'word' },
//      { value: 'playground', tag: 'word' },
//      { value: 'पर', tag: 'word' },
//      { value: 'हैं', tag: 'word' },
//      { value: '।', tag: 'punctuation' } ]

Documentation

Check out the tokenizer API documentation to learn more.

Need Help?

If you spot a bug and the same has not yet been reported, raise a new issue or consider fixing it and sending a pull request.

About wink

Wink is a family of open source packages for Statistical Analysis, Natural Language Processing and Machine Learning in NodeJS. The code is thoroughly documented for easy human comprehension and has a test coverage of ~100% for reliability to build production grade solutions.

Copyright & License

It is licensed under the terms of the MIT License.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

winkjs / wink-tokenizer

Programming Languages

Labels

Projects that are alternatives of or similar to wink-tokenizer

wink-tokenizer

Installation

Getting Started

Documentation

Need Help?

About wink

Copyright & License