All Projects → ARBML → nmatheg

ARBML / nmatheg

Licence: other
A simple strategy for training and finetuning NLP models for Arabic. Specify the parameters and just wait for the results. A simple design that makes use of the different tools in our NLP pipeline.

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to nmatheg

ar-embeddings
Sentiment Analysis for Arabic Text (tweets, reviews, and standard Arabic) using word2vec
Stars: ✭ 83 (+336.84%)
Mutual labels:  arabic, arabic-nlp
tajmeeaton
تجميعة من المشاريع، وخصوصا مفتوحة المصدر، للنهوض باللغة العربية والأمة. 👨‍💻 👨‍🔬👨‍🏫🧕
Stars: ✭ 115 (+505.26%)
Mutual labels:  arabic, arabic-nlp
arabic-tagger
AQMAR Arabic Tagger: Sequence tagger with cost-augmented structured perceptron training
Stars: ✭ 38 (+100%)
Mutual labels:  arabic, arabic-nlp
ATKSpy
this repository is a python package that supports SOAP interface to communicate with the Microsoft ATKS
Stars: ✭ 27 (+42.11%)
Mutual labels:  arabic, arabic-nlp
farasapy
A Python implementation of Farasa toolkit
Stars: ✭ 69 (+263.16%)
Mutual labels:  arabic, arabic-nlp
BasicArabicOCR
A very basic Arabic OCR based on tesseract OCR engine written in Java.
Stars: ✭ 19 (+0%)
Mutual labels:  arabic, arabic-nlp
Redux React I18n
An i18n solution for React/Redux and React Native projects
Stars: ✭ 64 (+236.84%)
Mutual labels:  arabic
Arabert
Pre-trained Transformers for the Arabic Language Understanding and Generation (Arabic BERT, Arabic GPT2, Arabic Electra)
Stars: ✭ 176 (+826.32%)
Mutual labels:  arabic
Vazir Font
Vazir is a Persian/Arabic font. وزیر یک فونت فارسی/عربی است https://rastikerdar.github.io/vazir-font/
Stars: ✭ 1,085 (+5610.53%)
Mutual labels:  arabic
Dialectid e2e
End to End Dialect Identification using Convolutional Neural Network
Stars: ✭ 40 (+110.53%)
Mutual labels:  arabic
data structure and algorithms
This repository is related to an Arabic tutorial, within the tutorial we discuss the common data structure and algorithms and their worst and best case for each, then implement the code using Python.
Stars: ✭ 33 (+73.68%)
Mutual labels:  arabic
Amiri
Amiri Font Project.
Stars: ✭ 227 (+1094.74%)
Mutual labels:  arabic
Postcss Rtl
PostCSS plugin for RTL-adaptivity
Stars: ✭ 143 (+652.63%)
Mutual labels:  arabic
Soqal
Arabic Open Domain Question Answering System using Neural Reading Comprehension
Stars: ✭ 72 (+278.95%)
Mutual labels:  arabic
Manshar
NO LONGER BEING MAINTAINED - Arabic social publishing platform — منصة نشر مخصصة للغة العربية
Stars: ✭ 180 (+847.37%)
Mutual labels:  arabic
Mikhak
simple monoline Arabic-Latin semi handwriting typeface
Stars: ✭ 64 (+236.84%)
Mutual labels:  arabic
Tourism Demo
Flutter app backed by Redux, shows animations, internationalization (i18n), ClipPath, fonts and others...
Stars: ✭ 232 (+1121.05%)
Mutual labels:  arabic
Pythoncodes
Stars: ✭ 55 (+189.47%)
Mutual labels:  arabic
Camel tools
A suite of Arabic natural language processing tools developed by the CAMeL Lab at New York University Abu Dhabi.
Stars: ✭ 124 (+552.63%)
Mutual labels:  arabic
Alfanous
Alfanous is an Arabic search engine API provides the simple and advanced search in Quran , more features and many interfaces...
Stars: ✭ 209 (+1000%)
Mutual labels:  arabic

nmatheg

nmatheg نماذج an easy straregy for training Arabic NLP models on huggingface datasets. Just specifiy the name of the dataset, preprocessing, tokenization and the training procedure in the config file to train an nlp model for that task.

install

pip install nmatheg

Configuration

Setup a config file for the training strategy.

[dataset]
dataset_name = ajgt_twitter_ar

[preprocessing]
segment = False
remove_special_chars = False
remove_english = False
normalize = False
remove_diacritics = False
excluded_chars = []
remove_tatweel = False
remove_html_elements = False
remove_links = False 
remove_twitter_meta = False
remove_long_words = False
remove_repeated_chars = False

[tokenization]
tokenizer_name = WordTokenizer
vocab_size = 1000
max_tokens = 128

[model]
model_name = rnn

[log]
print_every = 10

[train]
save_dir = .
epochs = 10
batch_size = 256 

Main Sections

  • dataset describe the dataset and the task type. Currently we only support classification
  • preprocessing a set of cleaning functions mainly uses our library tnkeeh.
  • tokenization descrbies the tokenizer used for encoding the dataset. It uses our library tkseem.
  • train the training parameters like number of epochs and batch size.

Usage

Config Files

import nmatheg as nm
strategy = nm.TrainStrategy('config.ini')
strategy.start()

Benchmarking on multiple datasets and models

import nmatheg as nm
strategy = nm.TrainStrategy(
    datasets = 'arsentd_lev,caner,arcd', 
    models   = 'qarib/bert-base-qarib,aubmindlab/bert-base-arabertv01',
    mode = 'finetune',
    runs = 5,
    lr = 1e-4,
    epochs = 1,
    batch_size = 8,
    max_tokens = 128,
    max_train_samples = 1024
)
strategy.start()

Datasets

We are supporting huggingface datasets for Arabic. You can find the supported datasets here.

Dataset Description
ajgt_twitter_ar Arabic Jordanian General Tweets (AJGT) Corpus consisted of 1,800 tweets annotated as positive and negative. Modern Standard Arabic (MSA) or Jordanian dialect.
metrec The dataset contains the verses and their corresponding meter classes. Meter classes are represented as numbers from 0 to 13. The dataset can be highly useful for further research in order to improve the field of Arabic poems’ meter classification. The train dataset contains 47,124 records and the test dataset contains 8,316 records.
labr This dataset contains over 63,000 book reviews in Arabic. It is the largest sentiment analysis dataset for Arabic to-date. The book reviews were harvested from the website Goodreads during the month or March 2013. Each book review comes with the goodreads review id, the user id, the book id, the rating (1 to 5) and the text of the review.
ar_res_reviews Dataset of 8364 restaurant reviews from qaym.com in Arabic for sentiment analysis
arsentd_lev The Arabic Sentiment Twitter Dataset for Levantine dialect (ArSenTD-LEV) contains 4,000 tweets written in Arabic and equally retrieved from Jordan, Lebanon, Palestine and Syria.
oclar The researchers of OCLAR Marwan et al. (2019), they gathered Arabic costumer reviews Zomato website on wide scope of domain, including restaurants, hotels, hospitals, local shops, etc. The corpus finally contains 3916 reviews in 5-rating scale. For this research purpose, the positive class considers rating stars from 5 to 3 of 3465 reviews, and the negative class is represented from values of 1 and 2 of about 451 texts.
emotone_ar Dataset of 10,065 tweets in Arabic for Emotion detection in Arabic text
hard This dataset contains 93,700 hotel reviews in Arabic language.The hotel reviews were collected from Booking.com website during June/July 2016.The reviews are expressed in Modern Standard Arabic as well as dialectal Arabic.The following table summarize some tatistics on the HARD Dataset.
caner The Classical Arabic Named Entity Recognition corpus is a new corpus of tagged data that can be useful for handling the issues in recognition of Arabic named entities.
arcd Arabic Reading Comprehension Dataset (ARCD) composed of 1,395 questions posed by crowdworkers on Wikipedia articles.
mlqa MLQA contains QA instances in 7 languages, English, Arabic, German, Spanish, Hindi, Vietnamese and Simplified Chinese.
xnli XNLI is a subset of a few thousand examples from MNLI which has been translated into a 14 different languages (some low-ish resource).
tatoeba_mt The Tatoeba Translation Challenge is a multilingual dataset of machine translation benchmarks derived from user-contributed translations collected by Tatoeba.org and provided as parallel corpus from OPUS.

Tasks

Currently we support text classification, named entity recognition, question answering, machine translation and natural language inference.

Demo

Check this colab notebook for a quick demo.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].