All Projects → motazsaad → comparable-text-miner

motazsaad / comparable-text-miner

Licence: Apache-2.0 license
Comparable documents miner: Arabic-English morphological analysis, text processing, n-gram features extraction, POS tagging, dictionary translation, documents alignment, corpus information, text classification, tf-idf computation, text similarity computation, html documents cleaning

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to comparable-text-miner

ATKSpy
this repository is a python package that supports SOAP interface to communicate with the Microsoft ATKS
Stars: ✭ 27 (-12.9%)
Mutual labels:  pos-tagging, arabic-nlp
Pytorch-NLU
Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech ta…
Stars: ✭ 151 (+387.1%)
Mutual labels:  text-classification, pos-tagging
Hanlp
中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
Stars: ✭ 24,626 (+79338.71%)
Mutual labels:  text-classification, pos-tagging
rnn-text-classification-tf
Tensorflow implementation of Attention-based Bidirectional RNN text classification.
Stars: ✭ 26 (-16.13%)
Mutual labels:  text-classification
Ask2Transformers
A Framework for Textual Entailment based Zero Shot text classification
Stars: ✭ 102 (+229.03%)
Mutual labels:  text-classification
nlp-cheat-sheet-python
NLP Cheat Sheet, Python, spacy, LexNPL, NLTK, tokenization, stemming, sentence detection, named entity recognition
Stars: ✭ 69 (+122.58%)
Mutual labels:  pos-tagging
Kevinpro-NLP-demo
All NLP you Need Here. 个人实现了一些好玩的NLP demo,目前包含13个NLP应用的pytorch实现
Stars: ✭ 117 (+277.42%)
Mutual labels:  text-classification
character-level-cnn
Keras implementation of Character-level CNN for Text Classification
Stars: ✭ 56 (+80.65%)
Mutual labels:  text-classification
deepnlp
小时候练手的nlp项目
Stars: ✭ 11 (-64.52%)
Mutual labels:  text-classification
backprop
Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models.
Stars: ✭ 229 (+638.71%)
Mutual labels:  text-classification
ar-embeddings
Sentiment Analysis for Arabic Text (tweets, reviews, and standard Arabic) using word2vec
Stars: ✭ 83 (+167.74%)
Mutual labels:  arabic-nlp
X-Transformer
X-Transformer: Taming Pretrained Transformers for eXtreme Multi-label Text Classification
Stars: ✭ 127 (+309.68%)
Mutual labels:  text-classification
text-classification-baseline
Pipeline for fast building text classification TF-IDF + LogReg baselines.
Stars: ✭ 55 (+77.42%)
Mutual labels:  text-classification
sequence labeling tf
Sequence Labeling in Tensorflow
Stars: ✭ 18 (-41.94%)
Mutual labels:  pos-tagging
cross-lingual-struct-flow
PyTorch implementation of ACL paper https://arxiv.org/abs/1906.02656
Stars: ✭ 23 (-25.81%)
Mutual labels:  pos-tagging
Nepali-News-Classifier
Text Classification of Nepali Language Document. This Mini Project was done for the partial fulfillment of NLP Course : COMP 473.
Stars: ✭ 13 (-58.06%)
Mutual labels:  text-classification
wink-nlp
Developer friendly Natural Language Processing ✨
Stars: ✭ 312 (+906.45%)
Mutual labels:  pos-tagging
Graph-Based-TC
Graph-based framework for text classification
Stars: ✭ 24 (-22.58%)
Mutual labels:  text-classification
gum
Repository for the Georgetown University Multilayer Corpus (GUM)
Stars: ✭ 71 (+129.03%)
Mutual labels:  pos-tagging
text-classification-transformers
Easy text classification for everyone : Bert based models via Huggingface transformers (KR / EN)
Stars: ✭ 32 (+3.23%)
Mutual labels:  text-classification

Comparable text miner

Description

Comparable documents miner: Arabic-English morphological analysis, text processing, n-gram features extraction, POS tagging, dictionary translation, documents alignment, corpus information, text classification, tf-idf computation, text similarity computation, HTML documents cleaning, and others.

This software is implemented by Motaz SAAD (motaz dot saad at gmail do com) during his PhD work. The PhD thesis is available at: https://sites.google.com/site/motazsite/Home/publications/saad_phd.pdf

Motaz Saad. Mining Documents and Sentiments in Cross-lingual Context. PhD thesis, Université de Lorraine, January 2015.

This software processes Arabic and English text. To use this software, load it as follows:

import imp
tp = imp.load_source('textpro', 'textpro.py')
#Then, you can use functions as follows:
clean_text = tp.process_text(text)

Dependencies

This software depends on the following python packages scipy, numpy, nltk, sklearn, bs4. Please make sure that they are installed before using this software.

References

This software uses the following resources:

  • Arabic stopwords: http://www.ranks.nl/stopwords/arabic

  • Open Multilingual WordNet (OMW) dictionaries http://compling.hss.ntu.edu.sg/omw/ The references of OMW are listed below:

    • Francis Bond and Kyonghee Paik (2012), A survey of wordnets and their licenses In Proceedings of the 6th Global WordNet Conference (GWC 2012). Matsue. 64–71.
    • Francis Bond and Ryan Foster (2013), Linking and extending an open multilingual wordnet. In 51st Annual Meeting of the Association for Computational Linguistics: ACL-2013. Sofia. 1352–1362.
  • ISRI Arabic Stemmer, which is a rooting algorithm for Arabic text. The reference of ISRI Arabic Stemmer is below:

    • Taghva, K., Elkoury, R., and Coombs, J. 2005. Arabic Stemming without a root dictionary. Information Science Research Institute. University of Nevada, Las Vegas, USA.
  • This software modifies the ISRI Arabic Stemmer to perform light stemming for Arabic words.

Usage examples (demos)

  • Dictionary translation demo
python dict-demo.py <inputfile> <outputfile> <source language>
# translate from Arabic to English
python dict-demo.py test-text-files/dict-test-ar-input.txt test-text-files/dict-out.txt ar
# translate from English to Arabic
python dict-demo.py test-text-files/dict-test-en-input.txt test-text-files/dict--out.txt en
  • Arabic morphological analysis demo
python arabic-morphological-analysis-demo.py <inputfile> <outputfile>
python arabic-morphological-analysis-demo.py test-text-files/test-in.ar.txt test-text-files/test-out.ar.txt
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].