All Projects → ARBML → tkseem

ARBML / tkseem

Licence: MIT license
Arabic Tokenization Library. It provides many tokenization algorithms.

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to tkseem

lima
The Libre Multilingual Analyzer, a Natural Language Processing (NLP) C++ toolkit.
Stars: ✭ 75 (+66.67%)
Mutual labels:  tokenization
wink-tokenizer
Multilingual tokenizer that automatically tags each token with its type
Stars: ✭ 51 (+13.33%)
Mutual labels:  tokenization
youtokentome-ruby
High performance unsupervised text tokenization for Ruby
Stars: ✭ 17 (-62.22%)
Mutual labels:  tokenization
TweebankNLP
[LREC 2022] An off-the-shelf pre-trained Tweet NLP Toolkit (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Tweebank-NER dataset
Stars: ✭ 84 (+86.67%)
Mutual labels:  tokenization
ling
Natural Language Processing Toolkit in Golang
Stars: ✭ 57 (+26.67%)
Mutual labels:  tokenization
polycash
The ultimate open source betting protocol. PolyCash is a P2P blockchain platform for wallets, asset issuance, bonds & gaming.
Stars: ✭ 24 (-46.67%)
Mutual labels:  tokenization
BasicArabicOCR
A very basic Arabic OCR based on tesseract OCR engine written in Java.
Stars: ✭ 19 (-57.78%)
Mutual labels:  arabic-nlp
charformer-pytorch
Implementation of the GBST block from the Charformer paper, in Pytorch
Stars: ✭ 74 (+64.44%)
Mutual labels:  tokenization
spacy-server
🦜 Containerized HTTP API for industrial-strength NLP via spaCy and sense2vec
Stars: ✭ 58 (+28.89%)
Mutual labels:  tokenization
auto-data-tokenize
Identify and tokenize sensitive data automatically using Cloud DLP and Dataflow
Stars: ✭ 21 (-53.33%)
Mutual labels:  tokenization
xontrib-output-search
Get identifiers, paths, URLs and words from the previous command output and use them for the next command in xonsh shell.
Stars: ✭ 26 (-42.22%)
Mutual labels:  tokenization
nlp-cheat-sheet-python
NLP Cheat Sheet, Python, spacy, LexNPL, NLTK, tokenization, stemming, sentence detection, named entity recognition
Stars: ✭ 69 (+53.33%)
Mutual labels:  tokenization
bert tokenization for java
This is a java version of Chinese tokenization descried in BERT.
Stars: ✭ 39 (-13.33%)
Mutual labels:  tokenization
lunasec
LunaSec - Dependency Security Scanner that automatically notifies you about vulnerabilities like Log4Shell or node-ipc in your Pull Requests and Builds. Protect yourself in 30 seconds with the LunaTrace GitHub App: https://github.com/marketplace/lunatrace-by-lunasec/
Stars: ✭ 1,261 (+2702.22%)
Mutual labels:  tokenization
uax29
A tokenizer based on Unicode text segmentation (UAX 29), for Go
Stars: ✭ 26 (-42.22%)
Mutual labels:  tokenization
Vaaku2Vec
Language Modeling and Text Classification in Malayalam Language using ULMFiT
Stars: ✭ 68 (+51.11%)
Mutual labels:  tokenization
spacy russian tokenizer
Custom Russian tokenizer for spaCy
Stars: ✭ 35 (-22.22%)
Mutual labels:  tokenization
Spacy
💫 Industrial-strength Natural Language Processing (NLP) in Python
Stars: ✭ 21,978 (+48740%)
Mutual labels:  tokenization
vgs-collect-ios
VGS Collect iOS SDK
Stars: ✭ 17 (-62.22%)
Mutual labels:  tokenization
simplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
Stars: ✭ 32 (-28.89%)
Mutual labels:  tokenization

tkseem (تقسيم) is a tokenization library that encapsulates different approaches for tokenization and preprocessing of Arabic text.

Documentation

Please visit readthedocs for the full documentation.

Installation

pip install tkseem

Usage

Tokenization

import tkseem as tk
tokenizer = tk.WordTokenizer()
tokenizer.train('samples/data.txt')

tokenizer.tokenize("السلام عليكم")
tokenizer.encode("السلام عليكم")
tokenizer.decode([536, 829])

Caching

tokenizer.tokenize(open('data/raw/train.txt').read(), use_cache = True)

Save and Load

import tkseem as tk

tokenizer = tk.WordTokenizer()
tokenizer.train('samples/data.txt')

# save the model
tokenizer.save_model('vocab.pl')

# load the model
tokenizer = tk.WordTokenizer()
tokenizer.load_model('vocab.pl')

Model Agnostic

import tkseem as tk
import time 
import seaborn as sns
import pandas as pd

def calc_time(fun):
    start_time = time.time()
    fun().train()
    return time.time() - start_time

running_times = {}

running_times['Word'] = calc_time(tk.WordTokenizer)
running_times['SP'] = calc_time(tk.SentencePieceTokenizer)
running_times['Random'] = calc_time(tk.RandomTokenizer)
running_times['Disjoint'] = calc_time(tk.DisjointLetterTokenizer)
running_times['Char'] = calc_time(tk.CharacterTokenizer)

Notebooks

We show how to use tkseem to train some nlp models.

Name Description Notebook
Demo Explain the syntax of all tokenizers.
Sentiment Classification WordTokenizer for processing sentences and then train a classifier for sentiment classification.
Meter Classification CharacterTokenizer for meter classification using bidirectional GRUs.
Translation Seq-to-seq model with attention.
Question Answering Sequence to Sequence Model

Citation

@misc{tkseem2020,
  author = {Zaid Alyafeai and Maged Saeed},
  title = {tkseem: A Tokenization Library for Arabic.},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ARBML/tkseem}}
}

Contribution

This is an open source project where we encourage contributions from the community.

License

MIT license.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].