All Projects → AliOsm → arabic-text-diacritization

AliOsm / arabic-text-diacritization

Licence: MIT license
Benchmark Arabic text diacritization dataset

Programming Languages

python
139335 projects - #7 most used programming language
Jupyter Notebook
11667 projects

Projects that are alternatives of or similar to arabic-text-diacritization

elm-javascript-haskell-equivalents
Comparison of similar functions across Elm, Javascript, and Haskell
Stars: ✭ 31 (-24.39%)
Mutual labels:  comparison
language-benchmarks
A simple benchmark system for compiled and interpreted languages.
Stars: ✭ 21 (-48.78%)
Mutual labels:  comparison
Seiyuu.moe
A webpage searching for collaborate works between seiyuu.
Stars: ✭ 15 (-63.41%)
Mutual labels:  comparison
sequence labeling tf
Sequence Labeling in Tensorflow
Stars: ✭ 18 (-56.1%)
Mutual labels:  sequence-labeling
CrossNER
CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)
Stars: ✭ 87 (+112.2%)
Mutual labels:  sequence-labeling
version-compare
↔️ Rust library to easily compare version strings. Mirror from https://gitlab.com/timvisee/version-compare
Stars: ✭ 32 (-21.95%)
Mutual labels:  comparison
octoclairvoyant-webapp
Compare GitHub changelogs across multiple releases in a single view.
Stars: ✭ 45 (+9.76%)
Mutual labels:  comparison
hood
The plugin to manage benchmarks on your CI
Stars: ✭ 17 (-58.54%)
Mutual labels:  comparison
grids
A grid comparison standard
Stars: ✭ 74 (+80.49%)
Mutual labels:  comparison
npm-vs-yarn
Compare npm vs yarn
Stars: ✭ 36 (-12.2%)
Mutual labels:  comparison
microdiff
A fast, zero dependency object and array comparison library. Significantly faster than most other deep comparison libraries and has full TypeScript support.
Stars: ✭ 3,138 (+7553.66%)
Mutual labels:  comparison
Transferable-E2E-ABSA
Transferable End-to-End Aspect-based Sentiment Analysis with Selective Adversarial Learning (EMNLP'19)
Stars: ✭ 62 (+51.22%)
Mutual labels:  sequence-labeling
arabic-tagger
AQMAR Arabic Tagger: Sequence tagger with cost-augmented structured perceptron training
Stars: ✭ 38 (-7.32%)
Mutual labels:  arabic-language
neptune-client
📒 Experiment tracking tool and model registry
Stars: ✭ 348 (+748.78%)
Mutual labels:  comparison
farasapy
A Python implementation of Farasa toolkit
Stars: ✭ 69 (+68.29%)
Mutual labels:  diacritization
xdem
Analysis of digital elevation models (DEMs)
Stars: ✭ 50 (+21.95%)
Mutual labels:  comparison
deepseg
Chinese word segmentation in tensorflow 2.x
Stars: ✭ 23 (-43.9%)
Mutual labels:  sequence-labeling
ncdu-diff
ncdu fork that can compare and diff results
Stars: ✭ 21 (-48.78%)
Mutual labels:  comparison
hyperdiff
Find common, removed and added element between two collections.
Stars: ✭ 14 (-65.85%)
Mutual labels:  comparison
BERT-BiLSTM-CRF
BERT-BiLSTM-CRF的Keras版实现
Stars: ✭ 40 (-2.44%)
Mutual labels:  sequence-labeling

Arabic Text Diacritization

This repository contains the dataset, helpers, and systems comparison for our paper on Arabic Text Diacritization:

"Arabic Text Diacritization Using Deep Neural Networks", Ali Fadel, Ibraheem Tuffaha, Bara' Al-Jawarneh, and Mahmoud Al-Ayyoub, ICCAIS 2019.

Files

dataset

  • train.txt - Contains 50,000 lines of diacritized Arabic text which can be used as training dataset
  • val.txt - Contains 2,500 lines of diacritized Arabic text which can be used as validation dataset
  • test.txt - Contains 2,500 lines of diacritized Arabic text which can be used as testing dataset

helpers

  • constants
    • ARABIC_LETTERS_LIST.pickle - Contains list of Arbaic letters
    • CLASSES_LIST.pickle - Contains list of all possible classes
    • DIACRITICS_LIST.pickle - Contains list of all diacritics
  • count_characters.py - Counts the number of Arabic letters and diacritics in a file
  • count_fathatan.py - Counts the number of fathatan occurrences before and after Alif in all files from a folder
  • diacritization_stat.py - Calculates DER and WER using the gold data and the predicted output
  • diacritics_rate_extractor.py - Keeps lines with p% diacritics to Arabic characters rate or more in all files from a folder
  • file_lookup.py - Searches for a line in all files from a folder
  • fix_fathatan.py - Changes after-Alif fathatan to before-Alit fathatan in a file
  • remove_diacritics.py - Removes diacritics from a file
  • transliteration.py - Converts a file from Arabic text to Buckwalter transliteration and vice-versa
  • pre_process_tashkeela_corpus.ipynb - Pre-process Tashkeela Corpus data

existing_systems

  • ali-soft - Contains some bugs that exist in Ali-Soft system
  • farasa - Contains Farasa system output, fixed output, and DER/WER statistics
  • harakat - Contains Harakat system testing script, output, fixed output, and DER/WER statistics
  • madamira - Contains MADAMIRA system output, fixed output, and DER/WER statistics
  • mishkal - Contains Mishkal system output, fixed output, and DER/WER statistics
  • shakkala - Contains Shakkala system data splitting script, output, fixed output, and DER/WER statistics
  • tashkeela_model - Contains Tashkeela-Model system output, fixed output, and DER/WER statistics for each n-gram model provided by them

Note: All codes in this repository tested on Ubuntu 18.04

Contributors

  1. Ali Hamdi Ali Fadel.
  2. Ibraheem Tuffaha.
  3. Bara' Al-Jawarneh.
  4. Mahmoud Al-Ayyoub.

License

The project is available as open source under the terms of the MIT License.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].