All Projects → ArtificiAI → Multilingual Latent Dirichlet Allocation Lda

ArtificiAI / Multilingual Latent Dirichlet Allocation Lda

Licence: mit
A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Multilingual Latent Dirichlet Allocation Lda

Lda Topic Modeling
A PureScript, browser-based implementation of LDA topic modeling.
Stars: ✭ 91 (+42.19%)
Mutual labels:  natural-language-processing, lda, clustering
Newsrecommender
A news recommendation system tailored for user communities
Stars: ✭ 164 (+156.25%)
Mutual labels:  natural-language-processing, clustering
Practical Machine Learning With Python
Master the essential skills needed to recognize and solve complex real-world problems with Machine Learning and Deep Learning by leveraging the highly popular Python Machine Learning Eco-system.
Stars: ✭ 1,868 (+2818.75%)
Mutual labels:  natural-language-processing, clustering
kwx
BERT, LDA, and TFIDF based keyword extraction in Python
Stars: ✭ 33 (-48.44%)
Mutual labels:  multilingual, lda
Ml
A high-level machine learning and deep learning library for the PHP language.
Stars: ✭ 1,270 (+1884.38%)
Mutual labels:  natural-language-processing, clustering
Dat8
General Assembly's 2015 Data Science course in Washington, DC
Stars: ✭ 1,516 (+2268.75%)
Mutual labels:  natural-language-processing, clustering
nlp-lt
Natural Language Processing for Lithuanian language
Stars: ✭ 17 (-73.44%)
Mutual labels:  clustering, lda
Trankit
Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing
Stars: ✭ 311 (+385.94%)
Mutual labels:  natural-language-processing, multilingual
Nlp
Selected Machine Learning algorithms for natural language processing and semantic analysis in Golang
Stars: ✭ 304 (+375%)
Mutual labels:  natural-language-processing, lda
Link Grammar
The CMU Link Grammar natural language parser
Stars: ✭ 286 (+346.88%)
Mutual labels:  english, natural-language-processing
Text Analytics With Python
Learn how to process, classify, cluster, summarize, understand syntax, semantics and sentiment of text data with the power of Python! This repository contains code and datasets used in my book, "Text Analytics with Python" published by Apress/Springer.
Stars: ✭ 1,132 (+1668.75%)
Mutual labels:  natural-language-processing, clustering
Events
Repository for *SEM Paper on Event Coreference Resolution in ECB+
Stars: ✭ 20 (-68.75%)
Mutual labels:  natural-language-processing, clustering
text clustering
文本聚类(Kmeans、DBSCAN、LDA、Single-pass)
Stars: ✭ 230 (+259.38%)
Mutual labels:  clustering, lda
Lda
LDA topic modeling for node.js
Stars: ✭ 262 (+309.38%)
Mutual labels:  natural-language-processing, lda
Talisman
Straightforward fuzzy matching, information retrieval and NLP building blocks for JavaScript.
Stars: ✭ 584 (+812.5%)
Mutual labels:  natural-language-processing, clustering
Bpemb
Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE)
Stars: ✭ 909 (+1320.31%)
Mutual labels:  natural-language-processing, multilingual
Nlg Rl
Accelerated Reinforcement Learning for Sentence Generation by Vocabulary Prediction
Stars: ✭ 59 (-7.81%)
Mutual labels:  natural-language-processing
Slate
A Super-Lightweight Annotation Tool for Experts: Label text in a terminal with just Python
Stars: ✭ 61 (-4.69%)
Mutual labels:  natural-language-processing
Char Rnn Tensorflow
Multi-layer Recurrent Neural Networks for character-level language models implements by TensorFlow
Stars: ✭ 58 (-9.37%)
Mutual labels:  natural-language-processing
Teapot Nlp
Tool for Evaluating Adversarial Perturbations on Text
Stars: ✭ 58 (-9.37%)
Mutual labels:  natural-language-processing

Multilingual Latent Dirichlet Allocation (LDA) Pipeline

This project is for text clustering using the Latent Dirichlet Allocation (LDA) algorithm. It can be adapted to many languages provided that the Snowball stemmer, a dependency of this project, supports it.

Usage

from artifici_lda.lda_service import train_lda_pipeline_default


FR_STOPWORDS = [
    "le", "les", "la", "un", "de", "en",
    "a", "b", "c", "s",
    "est", "sur", "tres", "donc", "sont",
    # even slang/texto stop words:
    "ya", "pis", "yer"]
# Note: this list of stop words is poor and is just as an example.

fr_comments = [
    "Un super-chat marche sur le trottoir",
    "Les super-chats aiment ronronner",
    "Les chats sont ronrons",
    "Un super-chien aboie",
    "Deux super-chiens",
    "Combien de chiens sont en train d'aboyer?"
]

transformed_comments, top_comments, _1_grams, _2_grams = train_lda_pipeline_default(
    fr_comments,
    n_topics=2,
    stopwords=FR_STOPWORDS,
    language='french')

print(transformed_comments)
print(top_comments)
print(_1_grams)
print(_2_grams)

Output:

array([[0.14218195, 0.85781805],
       [0.11032992, 0.88967008],
       [0.16960695, 0.83039305],
       [0.88967041, 0.11032959],
       [0.8578187 , 0.1421813 ],
       [0.83039303, 0.16960697]])

['Un super-chien aboie', 'Les super-chats aiment ronronner']

[[('chiens', 3.4911404011996545), ('super', 2.5000203653313933)],
 [('chats',  3.4911393765493255), ('super', 2.499979634668601 )]]

[[('super chiens', 2.4921035508342464)],
 [('super chats',  2.492102155345991 )]]

How it works

See Multilingual-LDA-Pipeline-Tutorial for an exhaustive example (intended to be read from top to bottom, not skimmed through). For more explanations on the Inverse Lemmatization, see Stemming-words-from-multiple-languages.

Supported Languages

Those languages are supported:

  • Danish
  • Dutch
  • English
  • Finnish
  • French
  • German
  • Hungarian
  • Italian
  • Norwegian
  • Porter
  • Portuguese
  • Romanian
  • Russian
  • Spanish
  • Swedish
  • Turkish

You need to bring your own list of stop words. That could be achieved by computing the Term Frequencies on your corpus (or on a bigger corpus of the same language) and to use some of the most common words as stop words.

Dependencies and their license

numpy==1.14.3           # BSD-3-Clause and BSD-2-Clause BSD-like and Zlib
scikit-learn==0.19.1    # BSD-3-Clause
PyStemmer==1.3.0        # BSD-3-Clause and MIT
snowballstemmer==1.2.1  # BSD-3-Clause and BSD-2-Clause
translitcodec==0.4.0    # MIT License
scipy==1.1.0            # BSD-3-Clause and MIT-like

Unit tests

Run pytest with ./run_tests.sh. Coverage:

----------- coverage: platform linux, python 3.6.7-final-0 -----------
Name                                       Stmts   Miss  Cover
--------------------------------------------------------------
artifici_lda/__init__.py                       0      0   100%
artifici_lda/data_utils.py                    39      0   100%
artifici_lda/lda_service.py                   31      0   100%
artifici_lda/logic/__init__.py                 0      0   100%
artifici_lda/logic/count_vectorizer.py         9      0   100%
artifici_lda/logic/lda.py                     23      7    70%
artifici_lda/logic/letter_splitter.py         36      4    89%
artifici_lda/logic/stemmer.py                 60      3    95%
artifici_lda/logic/stop_words_remover.py      61      5    92%
--------------------------------------------------------------
TOTAL                                        259     19    93%

License

This project is published under the MIT License (MIT).

Copyright (c) 2018 Artifici online services inc.

Coded by Guillaume Chevalier at Neuraxio Inc.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].