All Projects → ashishrana160796 → online-course-recommendation-system

ashishrana160796 / online-course-recommendation-system

Licence: Unlicense license
Built on data from Pluralsight's course API fetched results. Works with model trained with K-means unsupervised clustering algorithm.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to online-course-recommendation-system

genieclust
Genie++ Fast and Robust Hierarchical Clustering with Noise Point Detection - for Python and R
Stars: ✭ 34 (+9.68%)
Mutual labels:  machine-learning-algorithms, data-analysis, clustering-algorithm
whyqd
data wrangling simplicity, complete audit transparency, and at speed
Stars: ✭ 16 (-48.39%)
Mutual labels:  pandas, data-analysis
Clustering-Python
Python Clustering Algorithms
Stars: ✭ 23 (-25.81%)
Mutual labels:  machine-learning-algorithms, clustering-algorithm
dataquest-guided-projects-solutions
My dataquest project solutions
Stars: ✭ 35 (+12.9%)
Mutual labels:  pandas, data-analysis
kobe-every-shot-ever
A Los Angeles Times analysis of Every shot in Kobe Bryant's NBA career
Stars: ✭ 66 (+112.9%)
Mutual labels:  pandas, data-analysis
Udacity-Data-Analyst-Nanodegree
Repository for the projects needed to complete the Data Analyst Nanodegree.
Stars: ✭ 31 (+0%)
Mutual labels:  pandas, data-analysis
tempo
API for manipulating time series on top of Apache Spark: lagged time values, rolling statistics (mean, avg, sum, count, etc), AS OF joins, downsampling, and interpolation
Stars: ✭ 212 (+583.87%)
Mutual labels:  pandas, data-analysis
Igel
a delightful machine learning tool that allows you to train, test, and use models without writing code
Stars: ✭ 2,956 (+9435.48%)
Mutual labels:  machine-learning-algorithms, data-analysis
Data-Science-Resources
A guide to getting started with Data Science and ML.
Stars: ✭ 17 (-45.16%)
Mutual labels:  pandas, data-analysis
Data-Science-101
Notes and tutorials on how to use python, pandas, seaborn, numpy, matplotlib, scipy for data science.
Stars: ✭ 19 (-38.71%)
Mutual labels:  pandas, data-analysis
Tf-Rec
Tf-Rec is a python💻 package for building⚒ Recommender Systems. It is built on top of Keras and Tensorflow 2 to utilize GPU Acceleration during training.
Stars: ✭ 18 (-41.94%)
Mutual labels:  machine-learning-algorithms, recommender-system
PandasVersusExcel
Python数据分析入门,数据分析师入门
Stars: ✭ 120 (+287.1%)
Mutual labels:  pandas, data-analysis
Web-Development
Created this new Repository for Open Source Contribution for Beginners
Stars: ✭ 25 (-19.35%)
Mutual labels:  up-for-grabs, beginner-friendly
PracticalMachineLearning
A collection of ML related stuff including notebooks, codes and a curated list of various useful resources such as books and softwares. Almost everything mentioned here is free (as speech not free food) or open-source.
Stars: ✭ 60 (+93.55%)
Mutual labels:  pandas, data-analysis
Statistical-Learning-using-R
This is a Statistical Learning application which will consist of various Machine Learning algorithms and their implementation in R done by me and their in depth interpretation.Documents and reports related to the below mentioned techniques can be found on my Rpubs profile.
Stars: ✭ 27 (-12.9%)
Mutual labels:  machine-learning-algorithms, clustering-algorithm
Genetic-Algorithm-on-K-Means-Clustering
Implementing Genetic Algorithm on K-Means and compare with K-Means++
Stars: ✭ 37 (+19.35%)
Mutual labels:  k-means, clustering-algorithm
DataProfiler
What's in your data? Extract schema, statistics and entities from datasets
Stars: ✭ 843 (+2619.35%)
Mutual labels:  pandas, data-analysis
Hdbscan
A high performance implementation of HDBSCAN clustering.
Stars: ✭ 2,032 (+6454.84%)
Mutual labels:  machine-learning-algorithms, clustering-algorithm
Nmflibrary
MATLAB library for non-negative matrix factorization (NMF): Version 1.8.1
Stars: ✭ 153 (+393.55%)
Mutual labels:  machine-learning-algorithms, data-analysis
Datscan
DatScan is an initiative to build an open-source CMS that will have the capability to solve any problem using data Analysis just with the help of various modules and a vast standardized module library
Stars: ✭ 13 (-58.06%)
Mutual labels:  pandas, data-analysis

Online-Course-Recommendation-System

Built on data fetched from Pluralsight's course API fetched results. Refer, their API to use the recent most data.
Works with model trained on K-means unsupervised clustering algorithm on text data vectorized tf-idf algorithm.

Architectural Diagram of Tool

Architecture of Recommendation Engine

Experiments

Experiment 1: Using k=8 as categories that are present on Pluralsight are eight in number. Just, a basic intuition to get started with.

Issue with this approach is that that it results in an higher SSE error as compared other higher values of k as shown in below figure.

Elbow Experiment Plot

Elbow Experiment Plot

Elbow/Knee method is a good visualization experiment to know where the optimum number of clusters are present. Ideally at a point where the error decreases drastically.

Experiment 2: Now, using k=30 as pointed by Elbow's method. The clusters formed are much more meaningful in this experiment. Observe, by printing the top 15 terms of both trained models with k=8 and 30 respectively. Also, for comparison see output screenshots of each clustering experiment.

To Get Started

  1. Extract out the finalized_model_k_8 or 30.zip stored model's zip file first. As, it will be used by recommend_util.py file.
  2. Extract out the courses.csv file. As, it will be used by recommend_util.py file for loading the data.
  3. Simply, execute 'python3 recommend_util.py'. It will return results for some pre-loaded queries that are already inserted in this file.

Training Model

  1. Run command 'python3 model_train_k_8.py or python3 model_train_k_30.py' for training the k-means model and storing it.

Outputs

Observe the outputs for k = 8 and 30 for certain pre-defined courses. It is clearly visible that k=30 returns better recommendations based on the clustering algorithm of respective trained models.

Clusterization Output For K = 8

Elbow Experiment Plot

Clusterization Output For K = 30

Elbow Experiment Plot

Limitations

These k-means models perform good when it comes to predicting categories about courses that are in good proportion and is having reasonable number of keywords being associated with them. But, for courses that are less in number of all categories. The recommendation or predictions of correlated courses associated with a cluster are not good. This can be clearly seen from example below. For these following machine learning courses queries. We received cluster results that are nowhere near good being good recommendations. It highlights the fact that even after minimization of loss via SSE the cluster formed are not that accurate. As, they are correlating course descriptions for different courses that are not related to each other.

Machine learning Course Queries:

play-by-play-machine-learning-exposed
microsoft-cognitive-services-machine-learning
python-scikit-learn-building-machine-learning-models
pandas-data-wrangling-machine-learning-engineers
xgboost-python-scikit-learn-machine-learning

Cluster results for above mentioned queries

Machine learning queries for course recommendation

A Possible Hope: LDA(Latent Dirichlet Allocation) methods based on BOW model and TF-IDF model

It is a type of statistical modeling for clustering the abstract topics that occur in document collections. It classifies text in a document to a particular topic. And with that idea in mind we'll be aiming to construct clusters around our courses.csv data.

Just run python3 lda_train.py command.

Every execution step and information related to it will get printed along with topic analysis related to each word. In this file main highlights are elimination of extreme values from corpus. Also, constructing LDA based on TF-IDF model from BOW corpus and a standalone LDA based on BOW model.

But, still predictions related to machine learning queries for example as mentioned above. Are not good for this model also. Refer figure below, to see the words related predicted with respect to a ML course related query.

Query result for 'Play by Play: Machine Learning Exposed'

Machine learning query from Play by Play series

The major governing term in the above result 'play' is often repeated in the dataset and is often associated with multiple domain courses that are not related to machine learning. Hence, similar problems keep on existing even after training such good models. Another, idea is now to pick up only key phrases related to domains only for obtaining better results. Or train neural networks for better prediction results along with above domain related pharses idea.

Even, after that results given by this model are slightly better as compared to earlier trained models. Hence, it is better to use lda_train.py to train your model and saving it. Plus, making changes accordingly to construct new recommend_util.py file.

Requirements

Make sure python(3.x), pandas, sklearn, pickle, numpy are present your system for running this module.

Kudos !!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].