All Projects → nschneid → arabic-tagger

nschneid / arabic-tagger

Licence: GPL-3.0 license
AQMAR Arabic Tagger: Sequence tagger with cost-augmented structured perceptron training

Programming Languages

java
68154 projects - #9 most used programming language
perl
6916 projects
python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to arabic-tagger

tajmeeaton
تجميعة من المشاريع، وخصوصا مفتوحة المصدر، للنهوض باللغة العربية والأمة. 👨‍💻 👨‍🔬👨‍🏫🧕
Stars: ✭ 115 (+202.63%)
Mutual labels:  arabic, arabic-nlp, arabic-language
ar-embeddings
Sentiment Analysis for Arabic Text (tweets, reviews, and standard Arabic) using word2vec
Stars: ✭ 83 (+118.42%)
Mutual labels:  arabic, arabic-nlp
arabic-programming-blogs
أهم المدونات والمصادر العربية لتعلم البرمجة وتطوير الويب
Stars: ✭ 41 (+7.89%)
Mutual labels:  arabic, arabic-language
farasapy
A Python implementation of Farasa toolkit
Stars: ✭ 69 (+81.58%)
Mutual labels:  arabic, arabic-nlp
ATKSpy
this repository is a python package that supports SOAP interface to communicate with the Microsoft ATKS
Stars: ✭ 27 (-28.95%)
Mutual labels:  arabic, arabic-nlp
BasicArabicOCR
A very basic Arabic OCR based on tesseract OCR engine written in Java.
Stars: ✭ 19 (-50%)
Mutual labels:  arabic, arabic-nlp
arabic-stop-words
Largest list of Arabic stop words on Github. أكبر قائمة لمستبعدات الفهرسة العربية على جيت هاب
Stars: ✭ 193 (+407.89%)
Mutual labels:  arabic-nlp, arabic-language
sarf
Sarf - Arabic Morphology System
Stars: ✭ 20 (-47.37%)
Mutual labels:  arabic, arabic-language
nmatheg
A simple strategy for training and finetuning NLP models for Arabic. Specify the parameters and just wait for the results. A simple design that makes use of the different tools in our NLP pipeline.
Stars: ✭ 19 (-50%)
Mutual labels:  arabic, arabic-nlp
Sumrized
Automatic Text Summarization (English/Arabic).
Stars: ✭ 37 (-2.63%)
Mutual labels:  nlp-machine-learning, arabic-nlp
Conditional-SeqGAN-Tensorflow
Conditional Sequence Generative Adversarial Network trained with policy gradient, Implementation in Tensorflow
Stars: ✭ 47 (+23.68%)
Mutual labels:  nlp-machine-learning
citar
Citar HMM part-of-speech tagger
Stars: ✭ 16 (-57.89%)
Mutual labels:  tagger
ara
ع Command line tool that displays Arabic text in terminal.
Stars: ✭ 27 (-28.95%)
Mutual labels:  arabic
tag-picker
Better tags input interaction with JavaScript.
Stars: ✭ 27 (-28.95%)
Mutual labels:  tagger
lidtk
Language Identification Toolkit
Stars: ✭ 17 (-55.26%)
Mutual labels:  nlp-machine-learning
vlainic.github.io
My GitHub blog: things you might be interested, and probably not...
Stars: ✭ 26 (-31.58%)
Mutual labels:  nlp-machine-learning
ShortText-Fasttext
ShortText classification
Stars: ✭ 12 (-68.42%)
Mutual labels:  nlp-machine-learning
kex
Kex is a python library for unsupervised keyword extraction from a document, providing an easy interface and benchmarks on 15 public datasets.
Stars: ✭ 46 (+21.05%)
Mutual labels:  nlp-machine-learning
RcppMeCab
RcppMeCab: Rcpp Interface of CJK Morpheme Analyzer MeCab
Stars: ✭ 24 (-36.84%)
Mutual labels:  tagger
nodejs-support
한국어 형태소 및 구문 분석기의 모음인, KoalaNLP의 Javascript(Node.js) 버전입니다.
Stars: ✭ 81 (+113.16%)
Mutual labels:  tagger

AQMAR Arabic Tagger

This package provides a sequence tagger implementation customized for Arabic features, including a named entity detection model especially intended for Arabic Wikipedia. It was trained on labeled ACE and ANER data as well as an unlabeled Wikipedia corpus. Learning is with the structured perceptron, optionally in a cost-augmented fashion. Feature extraction is handled as a preprocessing step prior to learning/decoding.

The tagger was used for the experiments reported in

  • Behrang Mohit, Nathan Schneider, Rishav Bhowmick, Kemal Oflazer, and Noah A. Smith (2012), Recall-Oriented Learning of Named Entities in Arabic Wikipedia. Proceedings of EACL.

and accompanies the AQMAR Arabic Wikipedia Named Entity Corpus also described in that work; both can be obtained at

http://www.ark.cs.cmu.edu/AQMAR/

The Java tagger was adapted from Michael Heilman's supersense tagger implementation for English (http://www.ark.cs.cmu.edu/mheilman/questions/). It requires a minimum Java version of 1.6. Feature extraction uses Python and depends on the MADA toolkit (http://www1.ccls.columbia.edu/MADA/; version 3.1 was used for the Named Entity Corpus).

The AQMAR Arabic Tagger is released under the GNU General Public License (GPL) version 3 or later; see LICENSE. (Michael Heilman's supersense tagger, which we modify, was originally released in 2011 under GPL version 2 or later; the JSAP library, which we link to, was originally released by Martian Software in 2011 under the Lesser GNU Public License.)

Contents

  • eval/

    README and scripts for NER evaluation.

  • featExtract/

    README and scripts for feature extraction.

  • lib/

    External libraries required for the Java tagger.

  • model/

    Serialized tagging models, namely the best Arabic Wikipedia tagger reported in the EACL paper.

  • src/

    Java source files for the tagger.

  • arabic-tagger.jar

    Compiled Java program for training and decoding with the tagger.

  • build.sh

    Script for compiling the Java sources.

  • sample.properties

    An example properties file that can be used to specify options for the tagger. Options may alternatively be passed as command-line flags; if an option is specified in both places, the command-line value will take precedence.

  • LICENSE

  • README

  • VERSION

Usage

Extracting features for text data: See featExtract/README.txt

Running the Arabic named entity tagger

For example, the following command will use the existing named entity model in the model/ directory:

java -Xmx8000m -XX:+UseCompressedOops -jar arabic-tagger.jar 
	--load model/arabic-ner-superROP200.selfROP100.ser.gz 
	--test-predict featExtract/sample.bio.nerFeats --usePrevLabel true
	--properties sample.properties > predictions.out

Training a tagging model

Here is an example command for training a model on the sample feature-extracted data:

java -Xmx8000m -XX:+UseCompressedOops -jar arabic-tagger.jar 
	--save model/sample-model.ser.gz --iters 10 --no-averaging
	--labels featExtract/sample.labels --train featExtract/sample.nerFeats --debug --disk --weights
	--properties sample.properties > weights.out

or boundaries only:

java -Xmx8000m -XX:+UseCompressedOops -jar arabic-tagger.jar 
	--save model/sample-model.ser.gz --iters 10 --no-averaging
	--labels featExtract/bio.labels --train featExtract/sample.bio.nerFeats --debug --disk --weights
	--properties sample.properties > weights.out

Until this bug is fixed, we recommend specifying --no-averaging for training.

For details about options, run

java -jar arabic-tagger.jar --help
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].