All Projects → MagedSaeed → farasapy

MagedSaeed / farasapy

Licence: MIT license
A Python implementation of Farasa toolkit

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to farasapy

BasicArabicOCR
A very basic Arabic OCR based on tesseract OCR engine written in Java.
Stars: ✭ 19 (-72.46%)
Mutual labels:  arabic, arabic-nlp
tajmeeaton
تجميعة من المشاريع، وخصوصا مفتوحة المصدر، للنهوض باللغة العربية والأمة. 👨‍💻 👨‍🔬👨‍🏫🧕
Stars: ✭ 115 (+66.67%)
Mutual labels:  arabic, arabic-nlp
nmatheg
A simple strategy for training and finetuning NLP models for Arabic. Specify the parameters and just wait for the results. A simple design that makes use of the different tools in our NLP pipeline.
Stars: ✭ 19 (-72.46%)
Mutual labels:  arabic, arabic-nlp
ATKSpy
this repository is a python package that supports SOAP interface to communicate with the Microsoft ATKS
Stars: ✭ 27 (-60.87%)
Mutual labels:  arabic, arabic-nlp
arabic-tagger
AQMAR Arabic Tagger: Sequence tagger with cost-augmented structured perceptron training
Stars: ✭ 38 (-44.93%)
Mutual labels:  arabic, arabic-nlp
Camel tools
A suite of Arabic natural language processing tools developed by the CAMeL Lab at New York University Abu Dhabi.
Stars: ✭ 124 (+79.71%)
Mutual labels:  named-entity-recognition, arabic
ar-embeddings
Sentiment Analysis for Arabic Text (tweets, reviews, and standard Arabic) using word2vec
Stars: ✭ 83 (+20.29%)
Mutual labels:  arabic, arabic-nlp
CrossNER
CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)
Stars: ✭ 87 (+26.09%)
Mutual labels:  named-entity-recognition
anonymization-api
How to build and deploy an anonymization API with FastAPI
Stars: ✭ 51 (-26.09%)
Mutual labels:  named-entity-recognition
nlp-cheat-sheet-python
NLP Cheat Sheet, Python, spacy, LexNPL, NLTK, tokenization, stemming, sentence detection, named entity recognition
Stars: ✭ 69 (+0%)
Mutual labels:  named-entity-recognition
banglabert
This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla" accpeted in Findings of the Annual Conference of the North American Chap…
Stars: ✭ 186 (+169.57%)
Mutual labels:  named-entity-recognition
TwitterNER
Twitter named entity extraction for WNUT 2016 http://noisy-text.github.io/2016/ner-shared-task.html
Stars: ✭ 134 (+94.2%)
Mutual labels:  named-entity-recognition
korean ner tagging challenge
KU_NERDY 이동엽, 임희석 (2017 국어 정보 처리 시스템경진대회 금상) - 한글 및 한국어 정보처리 학술대회
Stars: ✭ 30 (-56.52%)
Mutual labels:  named-entity-recognition
hunspell
High-Performance Stemmer, Tokenizer, and Spell Checker for R
Stars: ✭ 101 (+46.38%)
Mutual labels:  tokenizer
arabic-sentiment-analysis
Sentiment Analysis in Arabic tweets
Stars: ✭ 64 (-7.25%)
Mutual labels:  arabic-nlp
tokenizer
A simple tokenizer in Ruby for NLP tasks.
Stars: ✭ 44 (-36.23%)
Mutual labels:  tokenizer
qahiri
Qahiri (قاهري) is a manuscript Kufic typeface
Stars: ✭ 45 (-34.78%)
Mutual labels:  arabic
comparable-text-miner
Comparable documents miner: Arabic-English morphological analysis, text processing, n-gram features extraction, POS tagging, dictionary translation, documents alignment, corpus information, text classification, tf-idf computation, text similarity computation, html documents cleaning
Stars: ✭ 31 (-55.07%)
Mutual labels:  arabic-nlp
deepnlp
小时候练手的nlp项目
Stars: ✭ 11 (-84.06%)
Mutual labels:  named-entity-recognition
psr2r-sniffer
A PSR-2-R code sniffer and code-style auto-correction-tool - including many useful additions
Stars: ✭ 32 (-53.62%)
Mutual labels:  tokenizer

Table of Content

Open In Colab

Downloads License PythonVersion PyPiVersion

Disclaimer

This is a Python API wrapper for farasa [http://qatsdemo.cloudapp.net/farasa/] toolkit. Although this work is licsenced under MIT, the original work(the toolkit) is strictly premitted for research purposes only. For any commercial uses, please contact the toolkit creators[http://qatsdemo.cloudapp.net/farasa/].

Introduction

Farasa is an Arabic NLP toolkit serving the following tasks:

  1. Segmentation.
  2. Stemming.
  3. Named Entity Recognition (NER).
  4. Part Of Speech tagging (POS tagging).
  5. Diacritization.

The toolkit is built and compiled in Java. Developers who want to use it without using this library may call the binaries directly from their code.

As Python is a general purpose language and so popular for many NLP tasks, an automation to these calls to the toolkit from the code would be convenient. This is where this wrapper fits.

Installation

pip install farasapy

How to use

An interactive Google colab code of the library can be reached from here [https://colab.research.google.com/drive/1xjzYwmfAszNzfR6Z2lSQi3nKYcjarXAW?usp=sharing].

AN IMPORTANT REMARK

  • The library, as it is a wrapper for Java jars, requires that Java is installed in your system and is in your PATH. It is, also, not recommended to have a version below Java 1.7.

  • Some binaries are computationally HEAVY!

An Overview

Farasapy wraps and maintains all the toolkit's APIs in different classes where each class is in separate file. You need to import your class of interest from its file as follows:

from farasa.pos import FarasaPOSTagger 
from farasa.ner import FarasaNamedEntityRecognizer 
from farasa.diacratizer import FarasaDiacritizer 
from farasa.segmenter import FarasaSegmenter 
from farasa.stemmer import FarasaStemmer

Now, If you are using the library for the first time, the library needs to download farasa toolkit binaries first. You do not need to worry about anything. The library, whenever you instantiate an object of any of its classes, will first check for the binaries, download them if they are not existed. This is an example of instantiating an object from FarasaStemmer for the first use of the library.

stemmer = FarasaStemmer()
perform system check...
check java version...
Your java version is 1.8 which is compatiple with Farasa
check toolkit binaries...
some binaries are not existed..
downloading zipped binaries...
100%|███████████████████████████████████████| 200M/200M [02:39<00:00, 1.26MiB/s]
extracting...
toolkit binaries are downloaded and extracted.
Dependencies seem to be satisfied..
task [STEM] is initialized in STANDALONE mode...

let us stem the following example:

sample =\ 
''' 
يُشار إلى أن اللغة العربية يتحدثها أكثر من 422 مليون نسمة ويتوزع متحدثوها
 في المنطقة المعروفة باسم الوطن العربي بالإضافة إلى العديد من المناطق ال
أخرى المجاورة مثل الأهواز وتركيا وتشاد والسنغال وإريتريا وغيرها.وهي اللغ
ة الرابعة من لغات منظمة الأمم المتحدة الرسمية الست. 
'''
stemmed_text = stemmer.stem(sample)                                     
print(stemmed_text)
'أشار إلى أن لغة عربي تحدث أكثر من 422 مليون نسمة توزع متحدثوها في منطقة معروف اسم وطن عربي إضافة إلى عديد من منطقة آخر مجاور مثل أهواز تركيا تشاد سنغال أريتريا غير . هي لغة رابع من لغة منظمة أمة متحد رسمي ست .'

You may notice that the last line of object instantiation states that the object is instantiated in STANDALONE mode. Farasapy, like the toolkit binaries themselves, can run in two different modes: Interactive and Standalone.

Standalone Mode

In standalone mode, the instantiated object will call the binary each time it performs its task. It will put the input text in a temporary file, execute the binary with this temporary file, and finally extract the output from another temporary file. These temporary files are garbage collected once the task ends. Be careful that some binaries, like the diacritizer, might take very long time to start. Hence, this option is preferred when you have long text and you want to do it only once.

Interactive Mode

In interactive mode, the object will run the binary once instantiated. It, then, will feed the text to the binary interactively and capture the output on each input. However, the user should be careful not to put large lines as the output, just like in shells, might not be as expected. It is a good practice to terminate by my_obj.terminate() these kinds of objects once they are not needed to avoid any unexpected behaviour in your code.

For best practices, use the INTERACTIVE mode where the input text is small and you need to do the task multiple times. However, The STANDALONE mode is the best for large input texts where the task is expected to be done only once.

To work on interactive mode, you just need to pass interactive=True option to your object constructor.

The following is an example on the segmentation API that is running interactively.

segmenter = FarasaSegmenter(interactive=True)
perform system check...
check java version...
Your java version is 1.8 which is compatiple with Farasa 
check toolkit binaries...
Dependencies seem to be satisfied..
/path/to/the/library/farasa/__base.py:40: UserWarning: Be careful with large lines as they may break on interactive mode. You may switch to Standalone mode for such cases.
warnings.warn("Be careful with large lines as they may break on interactive mode. You may switch to Standalone mode for such cases.")
initializing [SEGMENT] task in INTERACTIVE mode...
task [SEGMENT] is initialized interactively.


segmented = segmenter.segment(sample)
print(segmented)
'يشار إلى أن ال+لغ+ة ال+عربي+ة يتحدث+ها أكثر من 422 مليون نسم+ة و+يتوزع متحدثوها في ال+منطق+ة ال+معروف+ة باسم ال+وطن ال+عربي ب+ال+إضاف+ة إلى ال+عديد من ال+مناطق ال+أخرى ال+مجاور+ة مثل ال+أهواز و+تركيا و+تشاد و+ال+سنغال و+إريتريا و+غير+ها . و+هي ال+لغ+ة ال+رابع+ة من لغ+ات منظم+ة ال+أمم ال+متحد+ة ال+رسمي+ة ال+ست .'

Contribution

Want to cite?

You can find the list of publications to site from here: http://qatsdemo.cloudapp.net/farasa/.

Useful URLs

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].