All Projects → Qutuf → Qutuf

Qutuf / Qutuf

Qutuf (قُطُوْف): An Arabic Morphological analyzer and Part-Of-Speech tagger as an Expert System.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Qutuf

Jumanpp
Juman++ (a Morphological Analyzer Toolkit)
Stars: ✭ 254 (+202.38%)
Mutual labels:  morphological-analysis, pos-tagging, part-of-speech-tagger
Awesome Persian Nlp Ir
Curated List of Persian Natural Language Processing and Information Retrieval Tools and Resources
Stars: ✭ 460 (+447.62%)
Mutual labels:  morphological-analysis, part-of-speech-tagger, stemmer
wink-nlp
Developer friendly Natural Language Processing ✨
Stars: ✭ 312 (+271.43%)
Mutual labels:  pattern-matching, pos-tagging
GrammarEngine
Грамматический Словарь Русского Языка (+ английский, японский, etc)
Stars: ✭ 68 (-19.05%)
Mutual labels:  part-of-speech-tagger, morphological-analysis
ATKSpy
this repository is a python package that supports SOAP interface to communicate with the Microsoft ATKS
Stars: ✭ 27 (-67.86%)
Mutual labels:  arabic, pos-tagging
Sudachipy
Python version of Sudachi, a Japanese tokenizer.
Stars: ✭ 207 (+146.43%)
Mutual labels:  morphological-analysis, pos-tagging
Arabicstemmer
Assem's Arabic Light Stemmer is a snowball-based stemming algorithm for Arabic aimed mainly to improve search.
Stars: ✭ 102 (+21.43%)
Mutual labels:  arabic, stemmer
udar
UDAR Does Accented Russian: A finite-state morphological analyzer of Russian that handles stressed wordforms.
Stars: ✭ 15 (-82.14%)
Mutual labels:  pos-tagging, morphological-analysis
Pytorch Pos Tagging
A tutorial on how to implement models for part-of-speech tagging using PyTorch and TorchText.
Stars: ✭ 96 (+14.29%)
Mutual labels:  pos-tagging, part-of-speech-tagger
HebPipe
An NLP pipeline for Hebrew
Stars: ✭ 15 (-82.14%)
Mutual labels:  part-of-speech-tagger, morphological-analysis
Articutapi
API of Articut 中文斷詞 (兼具語意詞性標記):「斷詞」又稱「分詞」,是中文資訊處理的基礎。Articut 不用機器學習,不需資料模型,只用現代白話中文語法規則,即能達到 SIGHAN 2005 F1-measure 94% 以上,Recall 96% 以上的成績。
Stars: ✭ 252 (+200%)
Mutual labels:  pos-tagging, part-of-speech-tagger
Sudachidict
A lexicon for Sudachi
Stars: ✭ 127 (+51.19%)
Mutual labels:  morphological-analysis, pos-tagging
Jptdp
Neural network models for joint POS tagging and dependency parsing (CoNLL 2017-2018)
Stars: ✭ 146 (+73.81%)
Mutual labels:  pos-tagging, part-of-speech-tagger
Camel tools
A suite of Arabic natural language processing tools developed by the CAMeL Lab at New York University Abu Dhabi.
Stars: ✭ 124 (+47.62%)
Mutual labels:  arabic, morphological-analysis
Rdrpostagger
A fast and accurate POS and morphological tagging toolkit (EACL 2014)
Stars: ✭ 126 (+50%)
Mutual labels:  pos-tagging, part-of-speech-tagger
datalinguist
Stanford CoreNLP in idiomatic Clojure.
Stars: ✭ 93 (+10.71%)
Mutual labels:  pos-tagging, part-of-speech-tagger
rakutenma-python
Rakuten MA (Python version)
Stars: ✭ 15 (-82.14%)
Mutual labels:  pos-tagging, part-of-speech-tagger
Sudachi
A Japanese Tokenizer for Business
Stars: ✭ 496 (+490.48%)
Mutual labels:  morphological-analysis, pos-tagging
Kagome
Self-contained Japanese Morphological Analyzer written in pure Go
Stars: ✭ 554 (+559.52%)
Mutual labels:  morphological-analysis, pos-tagging
Phonlp
PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing (NAACL 2021)
Stars: ✭ 56 (-33.33%)
Mutual labels:  pos-tagging

Arabic Morphological Analyzer (with Stemmer) and Part-Of-Speech Tagger

Qutuf (قُطُوْف): An Arabic Morphological Analyzer (Including Stemming and Root Extraction) and Part-Of-Speech Tagger as an Expert System.

Available now as an experimental web-service at https://qutuf.herokuapp.com.

Qutuf is aimed to be the Core of a Framework for Arabic NLP (Natural Language Processing)

At Qutuf, some new concepts have been identified and implemented. Like First Normalization and Second Normalization text forms at the preprocessing phase and the Premature and Overdue Tagging at the Part-Of-Speech tagging task. Moreover, the POS tagging is designed and implemented as a rule-based expert system. A POS tagset, which is built based on a morphological feature tagset, has been designed and used in Qutuf.

Morphological Analysis Includes both Stemming (light stemming) and Root Extraction (heavy stemming). It achieve this by using finite state automates and rules for agreement developed for cliticization parsing. It also uses AlKhalil Morpho Sys open source database for root extraction, pattern matching, morphological feature and POS assignment and closed nouns after enriching it.

Execution

As Qutuf serves as a framework, you can run for the processing phase you need and passing the applicable parameter for each processing phase.

The entry code for running all features can be found at:

SourceCode/Views/run_Qutuf.py

And you can check the free web-service online as you can see in Qutuf as-a-Service section. However, we encourage you to run qutuf on your machine and start adding additional text processing steps on top of the existing ones.

Text Processing

text.Tokenize();

text.Normalize(2);

text.CompoundParsing();

text.PrematureTagging();

text.ParseClitics();

text.PatternMatching(prematureTaggingPositiveThreshold, prematureTaggingNegativeThreshold);

text.OverdueTagging(overdureTaggingThreshold, overdureTaggingTopReservants);

To understand the processing happen at each step, please refer to the publication

Qutuf as-a-Service

Qutuf is available as an experimental service at https://qutuf.herokuapp.com.

You can get the output in json, xml or html (default) by using the following url parameter:

outputformat = [json|xml|html]

JSON format

To process the word غير for example and get the result in JSON:

https://qutuf.herokuapp.com/?outputformat=json&text=غير

XML format

Similarly, to process the word غير for example and get the result in XML:

https://qutuf.herokuapp.com/?outputformat=xml&text=غير

HTML format

HTML is the default format. The following will give the output for the word غير in HTML:

https://qutuf.herokuapp.com/?text=غير

Using Qutuf for Lemmatization (you may call it Stemming)

Similar to outputformat mentioned above, there is an optional URL parameter to use when getting only the possible lemma(s) / Stem(s) of every word.

[functionality=lemma]

?functionality=lemma&outputformat=html&text=الملك بالملك فالملك وملك والملك

This URL parameter is optional and not passing it will tell Qutuf to give a detailed output.

Sample JSON output

Sample Lemmatization (Stemming)

If it is intended to get only the lemma/stem of the given words. Qutuf will do all the processing but will output only the lemma(s).

Bellow, is the output for 'الملك بالملك فالملك وملك والملك'. Note that if the string was not a word (like space, semicolon, exclamation marks, and etc.), there will be no lemmas attribute.

If Qutuf could not identify the word, it will mark has_been_identified as false and will return the original word as the lemma. This will happen if the word is not in the dictionaries nor it has a valid root-and-pattern.


{"Word":[{"@number_of_possibilities":"0","@original_string":" "},{"@has_been_identified":"true","@lemmas":"ملك","@number_of_possibilities":"10","@original_string":"الملك"},{"@number_of_possibilities":"0","@original_string":" "},{"@has_been_identified":"true","@lemmas":"ملك","@number_of_possibilities":"8","@original_string":"بالملك"},{"@number_of_possibilities":"0","@original_string":" "},{"@has_been_identified":"true","@lemmas":"ملك","@number_of_possibilities":"20","@original_string":"فالملك"},{"@number_of_possibilities":"0","@original_string":" "},{"@has_been_identified":"true","@lemmas":"ملك, مل","@number_of_possibilities":"60","@original_string":"وملك"},{"@number_of_possibilities":"0","@original_string":" "},{"@has_been_identified":"true","@lemmas":"ملك","@number_of_possibilities":"28","@original_string":"والملك"},{"@number_of_possibilities":"0","@original_string":"."}]}

Sample Full output

The following is the full output of the processed word "غير".


{  
  "Word":[  
  {  
    "@number_of_possibilities":"10",
    "@original_string":"غير",
    "SurfaceFormMorphemes":[  
      {  
        "@certainty":"0.98681640625",
        "@voweled_form":"غَيْر",
        "Proclitcs":null,
        "Cliticless":{  
        "@arabic_description":"حرف, حرف استثناء, ظاهر",
        "@tag":"p,x"
        },
        "Enclitics":null
      }, {  
        "@certainty":"0.96484375",
        "@voweled_form":"غَيْر",
        "Proclitcs":null,
        "Cliticless":{  
        "@arabic_description":"اسم, مذكر, مفرد, ؟, مرفوع أو منصوب أو مجرور, اسم جامد, معرَّف بالإضافة أو نكرة, ظاهر",
        "@tag":"n,m,s,؟,nag,q,ai",
        "Pattern":{  
          "@Lemma":"غير",
          "@root":"غير",
          "@unoweled":"فعل",
          "@voweled":"فَعْل"
        }
        },
        "Enclitics":null
      }, {  
        "@certainty":"0.953125",
        "@voweled_form":"غَيْرٌ",
        "Proclitcs":null,
        "Cliticless":{  
        "@arabic_description":"اسم, مذكر, مفرد, ؟, مرفوع, مصدر أصلي, نكرة, ظاهر",
        "@tag":"n,m,s,؟,n,g,i",
        "Pattern":{  
          "@Lemma":"غير",
          "@root":"غير",
          "@unoweled":"فعل",
          "@voweled":"فَعْلٌ"
        }
        },
        "Enclitics":null
      }, {  
        "@certainty":"0.953125",
        "@voweled_form":"غَيْرُ",
        "Proclitcs":null,
        "Cliticless":{  
        "@arabic_description":"اسم, مذكر, مفرد, ؟, مرفوع, مصدر أصلي, معرَّف بالإضافة, ظاهر",
        "@tag":"n,m,s,؟,n,g,a",
        "Pattern":{  
          "@Lemma":"غير",
          "@root":"غير",
          "@unoweled":"فعل",
          "@voweled":"فَعْلُ"
        }
        },
        "Enclitics":null
      }, {  
        "@certainty":"0.9296875",
        "@voweled_form":"غَيْرَ",
        "Proclitcs":null,
        "Cliticless":{  
        "@arabic_description":"اسم, مذكر, مفرد, ؟, منصوب, مصدر أصلي, معرَّف بالإضافة, ظاهر",
        "@tag":"n,m,s,؟,a,g,a",
        "Pattern":{  
          "@Lemma":"غير",
          "@root":"غير",
          "@unoweled":"فعل",
          "@voweled":"فَعْلَ"
        }
        },
        "Enclitics":null
      }, {  
        "@certainty":"0.859375",
        "@voweled_form":"غَيَّرَ",
        "Proclitcs":null,
        "Cliticless":{  
        "@arabic_description":"فعل, مذكر, مفرد, مسند إلى الغائب, -, ماض, مبني للمعلوم, لازمٌ أو مُتعدٍ, ؟, -, مزيد",
        "@tag":"v,m,s,t,-,p,a,iobt,؟",
        "Pattern":{  
          "@Lemma":"غير",
          "@root":"غير",
          "@unoweled":"فعل",
          "@voweled":"فَعَّلَ"
        }
        },
        "Enclitics":null
      }, {  
        "@certainty":"0.8125",
        "@voweled_form":"غُيِّرَ",
        "Proclitcs":null,
        "Cliticless":{  
        "@arabic_description":"فعل, مذكر, مفرد, مسند إلى الغائب, -, ماض, مبني للمجهول, لازمٌ أو مُتعدٍ, ؟, -, مزيد",
        "@tag":"v,m,s,t,-,p,p,iobt,؟",
        "Pattern":{  
          "@Lemma":"غير",
          "@root":"غير",
          "@unoweled":"فعل",
          "@voweled":"فُعِّلَ"
        }
        },
        "Enclitics":null
      }, {  
        "@certainty":"0.8125",
        "@voweled_form":"غِيرَ",
        "Proclitcs":null,
        "Cliticless":{  
        "@arabic_description":"فعل, مذكر, مفرد, مسند إلى الغائب, -, ماض, مبني للمجهول, لازمٌ أو مُتعدٍ, ؟, -, مجرد",
        "@tag":"v,m,s,t,-,p,p,iobt,؟",
        "Pattern":{  
          "@Lemma":"غير",
          "@root":"غير",
          "@unoweled":"فعل",
          "@voweled":"فِعلَ"
        }
        },
        "Enclitics":null
      }, {  
        "@certainty":"0.8125",
        "@voweled_form":"غِيرَ",
        "Proclitcs":null,
        "Cliticless":{  
        "@arabic_description":"فعل, مذكر, مفرد, مسند إلى الغائب, -, ماض, مبني للمجهول, لازمٌ أو مُتعدٍ, ؟, -, مجرد",
        "@tag":"v,m,s,t,-,p,p,iobt,؟",
        "Pattern":{  
          "@Lemma":"غير",
          "@root":"غور",
          "@unoweled":"فيل",
          "@voweled":"فِيلَ"
        }
        },
        "Enclitics":null
      }, {  
        "@certainty":"0.5",
        "@voweled_form":"غَيِّرْ",
        "Proclitcs":null,
        "Cliticless":{  
        "@arabic_description":"فعل, مذكر, مفرد, مسند إلى المخاطب, -, أمر, -, لازمٌ أو مُتعدٍ, ؟, غير مؤكد, مزيد",
        "@tag":"v,m,s,s,-,i,-,iobt,؟",
        "Pattern":{  
          "@Lemma":"غير",
          "@root":"غير",
          "@unoweled":"فعل",
          "@voweled":"فَعِّلْ"
        }
        },
        "Enclitics":{  
        "Enclitic":{  
          "@arabic_description":"اسم, مذكر, مفرد, ؟, مرفوع, ضمير رفع, معرفة, مضمر/مقدّر",
          "@tag":"n,m,s,؟,n,p,d"
            }
          }
        }
      ]
    }
  ]
}

To use the output as an input for another processing phase/program you develop, you may use, for example, the "tag" of each Proclitcs, Cliticless (the stem) and Enclitics when applicable.

Support & Contribute!

Your are welcome to contribution and support. Start by pressing Star and Fork for this repository, implement new features, create a pull request and we will be happy to review and accept your valuable features, enhancements and bug-fixes.

For any bug, question or feature-request, feel free to open an Issue at this repository.

For more, check http://qutuf.com and contact [email protected].

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].