All Projects → dayyass → latent-semantic-analysis

dayyass / latent-semantic-analysis

Licence: MIT License
Pipeline for training LSA models using Scikit-Learn.

Programming Languages

python
139335 projects - #7 most used programming language
Makefile
30231 projects
Dockerfile
14818 projects

Projects that are alternatives of or similar to latent-semantic-analysis

topic modelling financial news
Topic modelling on financial news with Natural Language Processing
Stars: ✭ 51 (+155%)
Mutual labels:  topic-modeling, latent-semantic-analysis
Document-Classification-using-LSA
Document classification using Latent semantic analysis in python
Stars: ✭ 16 (-20%)
Mutual labels:  lsa, latent-semantic-analysis
coronavirus-stats
Automatically scrape data and statistics on Coronavirus to make them easily accessible in CSV format
Stars: ✭ 47 (+135%)
Mutual labels:  pipeline
pipeline-editor
Cloud Pipelines Editor is a web app that allows the users to build and run Machine Learning pipelines without having to set up development environment.
Stars: ✭ 22 (+10%)
Mutual labels:  pipeline
jenkins-pipeline-gitflow-maven
Sample Maven project with a Jenkinsfile doing git-flow based release management
Stars: ✭ 47 (+135%)
Mutual labels:  pipeline
topic models
implemented : lsa, plsa, lda
Stars: ✭ 80 (+300%)
Mutual labels:  topic-modeling
DNAscan
DNAscan is a fast and efficient bioinformatics pipeline that allows for the analysis of DNA Next Generation sequencing data, requiring very little computational effort and memory usage.
Stars: ✭ 36 (+80%)
Mutual labels:  pipeline
JT1078Gateway
基于Pipeline实现的JT1078Gateway支持TCP/UDP,目前只支持http-flv、ws-flv、hls三种拉流方式
Stars: ✭ 50 (+150%)
Mutual labels:  pipeline
basin
Basin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser
Stars: ✭ 25 (+25%)
Mutual labels:  pipeline
pydataberlin-2017
Repo for my talk at the PyData Berlin 2017 conference
Stars: ✭ 63 (+215%)
Mutual labels:  topic-modeling
create-mithril-app
Sets up a mithril.js project with webpack
Stars: ✭ 20 (+0%)
Mutual labels:  pipeline
kwx
BERT, LDA, and TFIDF based keyword extraction in Python
Stars: ✭ 33 (+65%)
Mutual labels:  topic-modeling
pipecolor
A terminal filter to colorize output
Stars: ✭ 17 (-15%)
Mutual labels:  pipeline
topicApp
A simple Shiny App for Topic Modeling in R
Stars: ✭ 40 (+100%)
Mutual labels:  topic-modeling
gitlab-merger-bot
GitLab Merger Bot
Stars: ✭ 23 (+15%)
Mutual labels:  pipeline
re-mote
Re-mote operations using SSH and Re-gent
Stars: ✭ 61 (+205%)
Mutual labels:  pipeline
dropEst
Pipeline for initial analysis of droplet-based single-cell RNA-seq data
Stars: ✭ 71 (+255%)
Mutual labels:  pipeline
RNASeq
RNASeq pipeline
Stars: ✭ 30 (+50%)
Mutual labels:  pipeline
HAR
Recognize one of six human activities such as standing, sitting, and walking using a Softmax Classifier trained on mobile phone sensor data.
Stars: ✭ 18 (-10%)
Mutual labels:  pipeline
godot-exporter
Godot Engine Automation Pipeline Android – iOS – Linux – MacOS – Windows – HTML5 – Itch.io.
Stars: ✭ 54 (+170%)
Mutual labels:  pipeline

tests linter codecov

python 3.6 release (latest by date) license

pre-commit code style: black

pypi version pypi downloads

Latent Semantic Analysis

Pipeline for training LSA models using Scikit-Learn.

Usage

Instead of writing custom code for latent semantic analysis, you just need:

  1. install pipeline:
pip install latent-semantic-analysis
  1. run pipeline:
  • either in terminal:
lsa-train --path_to_config config.yaml
  • or in python:
import latent_semantic_analysis

latent_semantic_analysis.train(path_to_config="config.yaml")

NOTE: more about config file here.

No data preparation is needed, only a csv file with raw text column (with arbitrary name).

Config

The user interface consists of only one files:

  • config.yaml - general configuration with sklearn TF-IDF and SVD parameters

Change config.yaml to create the desired configuration and train LSA model with the following command:

  • terminal:
lsa-train --path_to_config config.yaml
  • python:
import latent_semantic_analysis

latent_semantic_analysis.train(path_to_config="config.yaml")

Default config.yaml:

seed: 42
path_to_save_folder: models

# data
data:
  data_path: data/data.csv
  sep: ','
  text_column: text

# tf-idf
tf-idf:
  lowercase: true
  ngram_range: (1, 1)
  max_df: 1.0
  min_df: 1

# svd
svd:
  n_components: 10
  algorithm: arpack

NOTE: tf-idf and svd are sklearn TfidfVectorizer and TruncatedSVD parameters correspondingly, so you can parameterize instances of these classes however you want.

Output

After training the model, the pipeline will return the following files:

  • model.joblib - sklearn pipeline with LSA (TF-IDF and SVD steps)
  • config.yaml - config that was used to train the model
  • logging.txt - logging file
  • doc2topic.json - document embeddings
  • term2topic.json - term embeddings

Requirements

Python >= 3.6

Citation

If you use latent-semantic-analysis in a scientific publication, we would appreciate references to the following BibTex entry:

@misc{dayyass2021lsa,
    author       = {El-Ayyass, Dani},
    title        = {Pipeline for training LSA models},
    howpublished = {\url{https://github.com/dayyass/latent-semantic-analysis}},
    year         = {2021}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].