All Projects → winkjs → wink-tokenizer

winkjs / wink-tokenizer

Licence: MIT license
Multilingual tokenizer that automatically tags each token with its type

Programming Languages

javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to wink-tokenizer

xontrib-output-search
Get identifiers, paths, URLs and words from the previous command output and use them for the next command in xonsh shell.
Stars: ✭ 26 (-49.02%)
Mutual labels:  tokenizer, tokenization
YouTube to m3u
Grab .m3u8 from YouTube live channels and makes .m3u IPTV Playlist from various languages and Events. Tamil / Malayalam / English / Hindi / French / Kids / Sports / Urudu etc.
Stars: ✭ 48 (-5.88%)
Mutual labels:  french, hindi
SoMeWeTa
A part-of-speech tagger with support for domain adaptation and external resources.
Stars: ✭ 20 (-60.78%)
Mutual labels:  german, french
eczar
Eczar: fonts for Devanagari and Latin
Stars: ✭ 52 (+1.96%)
Mutual labels:  latin, devanagari
React Input Tags
React component for tagging inputs.
Stars: ✭ 10 (-80.39%)
Mutual labels:  tokenizer, tagging
simplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
Stars: ✭ 32 (-37.25%)
Mutual labels:  tokenizer, tokenization
pH7-Internationalization
🎌 pH7CMS Internationalization (I18N) package 🙊 Get new languages for your pH7CMS website!
Stars: ✭ 17 (-66.67%)
Mutual labels:  multilingual, french
lima
The Libre Multilingual Analyzer, a Natural Language Processing (NLP) C++ toolkit.
Stars: ✭ 75 (+47.06%)
Mutual labels:  multilingual, tokenization
language-detector
Detect the language of text
Stars: ✭ 28 (-45.1%)
Mutual labels:  german, french
neural tokenizer
Tokenize English sentences using neural networks.
Stars: ✭ 64 (+25.49%)
Mutual labels:  tokenizer
theolog-ss2017
Notizen zur TheoLog-Vorlesung mit Begriffen aus Formale Systeme. Hinweis: die Unterlagen sind für die VL in 2017 und können Fehler enthalten
Stars: ✭ 18 (-64.71%)
Mutual labels:  german
farasapy
A Python implementation of Farasa toolkit
Stars: ✭ 69 (+35.29%)
Mutual labels:  tokenizer
grav-plugin-langswitcher
Grav LangSwitcher Plugin
Stars: ✭ 22 (-56.86%)
Mutual labels:  multilingual
microalg
Langage et environnements dédiés à l’algorithmique.
Stars: ✭ 12 (-76.47%)
Mutual labels:  french
preact-token-input
🔖 A text field that tokenizes input, for things like tags.
Stars: ✭ 57 (+11.76%)
Mutual labels:  tagging
deep-learning-german-tts
Thorsten-Voice: A free to use, offline working, high quality german TTS voice should be available for every project without any license struggling.
Stars: ✭ 268 (+425.49%)
Mutual labels:  german
french
French language pack to localize the Flarum forum software plus its official and third-party extensions.
Stars: ✭ 17 (-66.67%)
Mutual labels:  french
HistoryOfMe
Your own personal diary.
Stars: ✭ 50 (-1.96%)
Mutual labels:  german
additional tags
Redmine Plugin for adding tags functionality to issues and wiki pages.
Stars: ✭ 25 (-50.98%)
Mutual labels:  tagging
next-multilingual
An opinionated end-to-end solution for Next.js applications that requires multiple languages.
Stars: ✭ 135 (+164.71%)
Mutual labels:  multilingual

wink-tokenizer

Multilingual tokenizer that automatically tags each token with its type

Build Status Coverage Status Gitter

Tokenize sentences in Latin and Devanagari scripts using wink-tokenizer. Some of it's top feature are outlined below:

  1. Support for English, French, German, Hindi, Sanskrit, Marathi and many more.

  2. Intelligent tokenization of sentence containing words in more than one language.

  3. Automatic detection & tagging of different types of tokens based on their features:

    • These include word, punctuation, email, mention, hashtag, emoticon, and emoji etc.
    • User definable token types.
  4. High performance – tokenizes a typical english sentence at speed of over 2.4 million tokens/second and a complex tweet containing hashtags, emoticons, emojis, mentions, e-mail at a speed of over 1.5 million tokens/second (benchmarked on 2.2 GHz Intel Core i7 machine with 16GB RAM).

Installation

Use npm to install:

npm install wink-tokenizer --save

Getting Started

// Load tokenizer.
var tokenizer = require( 'wink-tokenizer' );
// Create it's instance.
var myTokenizer = tokenizer();

// Tokenize a tweet.
var s = '@superman: hit me up on my email [email protected], 2 of us plan party🎉 tom at 3pm:) #fun';
myTokenizer.tokenize( s );
// -> [ { value: '@superman', tag: 'mention' },
//      { value: ':', tag: 'punctuation' },
//      { value: 'hit', tag: 'word' },
//      { value: 'me', tag: 'word' },
//      { value: 'up', tag: 'word' },
//      { value: 'on', tag: 'word' },
//      { value: 'my', tag: 'word' },
//      { value: 'email', tag: 'word' },
//      { value: '[email protected]', tag: 'email' },
//      { value: ',', tag: 'punctuation' },
//      { value: '2', tag: 'number' },
//      { value: 'of', tag: 'word' },
//      { value: 'us', tag: 'word' },
//      { value: 'plan', tag: 'word' },
//      { value: 'party', tag: 'word' },
//      { value: '🎉', tag: 'emoji' },
//      { value: 'tom', tag: 'word' },
//      { value: 'at', tag: 'word' },
//      { value: '3pm', tag: 'time' },
//      { value: ':)', tag: 'emoticon' },
//      { value: '#fun', tag: 'hashtag' } ]

// Tokenize a French sentence.
s = 'Mieux vaut prévenir que guérir:-)';
myTokenizer.tokenize( s );
// -> [ { value: 'Mieux', tag: 'word' },
//      { value: 'vaut', tag: 'word' },
//      { value: 'prévenir', tag: 'word' },
//      { value: 'que', tag: 'word' },
//      { value: 'guérir', tag: 'word' },
//      { value: ':-)', tag: 'emoticon' } ]

// Tokenize a sentence containing Hindi and English.
s = 'द्रविड़ ने टेस्ट में ३६ शतक जमाए, उनमें 21 विदेशी playground पर हैं।';
myTokenizer.tokenize( s );
// -> [ { value: 'द्रविड़', tag: 'word' },
//      { value: 'ने', tag: 'word' },
//      { value: 'टेस्ट', tag: 'word' },
//      { value: 'में', tag: 'word' },
//      { value: '३६', tag: 'number' },
//      { value: 'शतक', tag: 'word' },
//      { value: 'जमाए', tag: 'word' },
//      { value: ',', tag: 'punctuation' },
//      { value: 'उनमें', tag: 'word' },
//      { value: '21', tag: 'number' },
//      { value: 'विदेशी', tag: 'word' },
//      { value: 'playground', tag: 'word' },
//      { value: 'पर', tag: 'word' },
//      { value: 'हैं', tag: 'word' },
//      { value: '।', tag: 'punctuation' } ]

Documentation

Check out the tokenizer API documentation to learn more.

Need Help?

If you spot a bug and the same has not yet been reported, raise a new issue or consider fixing it and sending a pull request.

About wink

Wink is a family of open source packages for Statistical Analysis, Natural Language Processing and Machine Learning in NodeJS. The code is thoroughly documented for easy human comprehension and has a test coverage of ~100% for reliability to build production grade solutions.

Copyright & License

wink-tokenizer is copyright 2017-21 GRAYPE Systems Private Limited.

It is licensed under the terms of the MIT License.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].