Alternatives and detailed information of tkseem

LunaSec - Dependency Security Scanner that automatically notifies you about vulnerabilities like Log4Shell or node-ipc in your Pull Requests and Builds. Protect yourself in 30 seconds with the LunaTrace GitHub App: https://github.com/marketplace/lunatrace-by-lunasec/

Stars: ✭ 1,261 (+2702.22%)

Mutual labels: tokenization

uax29

A tokenizer based on Unicode text segmentation (UAX 29), for Go

Stars: ✭ 26 (-42.22%)

Mutual labels: tokenization

Vaaku2Vec

Language Modeling and Text Classification in Malayalam Language using ULMFiT

Stars: ✭ 68 (+51.11%)

Mutual labels: tokenization

spacy russian tokenizer

Custom Russian tokenizer for spaCy

Stars: ✭ 35 (-22.22%)

Mutual labels: tokenization

Spacy

💫 Industrial-strength Natural Language Processing (NLP) in Python

Stars: ✭ 21,978 (+48740%)

Mutual labels: tokenization

vgs-collect-ios

VGS Collect iOS SDK

Stars: ✭ 17 (-62.22%)

Mutual labels: tokenization

simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

Stars: ✭ 32 (-28.89%)

Mutual labels: tokenization

View All Similar Projects ➔

tkseem (تقسيم) is a tokenization library that encapsulates different approaches for tokenization and preprocessing of Arabic text.

Documentation

Please visit readthedocs for the full documentation.

Installation

pip install tkseem

Usage

Tokenization

import tkseem as tk
tokenizer = tk.WordTokenizer()
tokenizer.train('samples/data.txt')

tokenizer.tokenize("السلام عليكم")
tokenizer.encode("السلام عليكم")
tokenizer.decode([536, 829])

Caching

tokenizer.tokenize(open('data/raw/train.txt').read(), use_cache = True)

Save and Load

import tkseem as tk

tokenizer = tk.WordTokenizer()
tokenizer.train('samples/data.txt')

# save the model
tokenizer.save_model('vocab.pl')

# load the model
tokenizer = tk.WordTokenizer()
tokenizer.load_model('vocab.pl')

Model Agnostic

import tkseem as tk
import time 
import seaborn as sns
import pandas as pd

def calc_time(fun):
    start_time = time.time()
    fun().train()
    return time.time() - start_time

running_times = {}

running_times['Word'] = calc_time(tk.WordTokenizer)
running_times['SP'] = calc_time(tk.SentencePieceTokenizer)
running_times['Random'] = calc_time(tk.RandomTokenizer)
running_times['Disjoint'] = calc_time(tk.DisjointLetterTokenizer)
running_times['Char'] = calc_time(tk.CharacterTokenizer)

Notebooks

We show how to use tkseem to train some nlp models.

Name	Description	Notebook
Demo	Explain the syntax of all tokenizers.
Sentiment Classification	WordTokenizer for processing sentences and then train a classifier for sentiment classification.
Meter Classification	CharacterTokenizer for meter classification using bidirectional GRUs.
Translation	Seq-to-seq model with attention.
Question Answering	Sequence to Sequence Model

Citation

@misc{tkseem2020,
  author = {Zaid Alyafeai and Maged Saeed},
  title = {tkseem: A Tokenization Library for Arabic.},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ARBML/tkseem}}
}

Contribution

This is an open source project where we encourage contributions from the community.

License

MIT license.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

ARBML / tkseem

Programming Languages

Labels

Projects that are alternatives of or similar to tkseem

Documentation

Installation

Usage

Tokenization

Caching

Save and Load

Model Agnostic

Notebooks

Citation

Contribution

License