API for manipulating time series on top of Apache Spark: lagged time values, rolling statistics (mean, avg, sum, count, etc), AS OF joins, downsampling, and interpolation

Stars: ✭ 212 (+583.87%)

Mutual labels: pandas, data-analysis

Igel

a delightful machine learning tool that allows you to train, test, and use models without writing code

Stars: ✭ 2,956 (+9435.48%)

Mutual labels: machine-learning-algorithms, data-analysis

Data-Science-Resources

A guide to getting started with Data Science and ML.

Stars: ✭ 17 (-45.16%)

Mutual labels: pandas, data-analysis

Data-Science-101

Notes and tutorials on how to use python, pandas, seaborn, numpy, matplotlib, scipy for data science.

Stars: ✭ 19 (-38.71%)

Mutual labels: pandas, data-analysis

Tf-Rec

Tf-Rec is a python💻 package for building⚒ Recommender Systems. It is built on top of Keras and Tensorflow 2 to utilize GPU Acceleration during training.

Stars: ✭ 18 (-41.94%)

Mutual labels: machine-learning-algorithms, recommender-system

PandasVersusExcel

Python数据分析入门，数据分析师入门

Stars: ✭ 120 (+287.1%)

Mutual labels: pandas, data-analysis

Web-Development

Created this new Repository for Open Source Contribution for Beginners

Stars: ✭ 25 (-19.35%)

Mutual labels: up-for-grabs, beginner-friendly

PracticalMachineLearning

A collection of ML related stuff including notebooks, codes and a curated list of various useful resources such as books and softwares. Almost everything mentioned here is free (as speech not free food) or open-source.

Stars: ✭ 60 (+93.55%)

Mutual labels: pandas, data-analysis

Statistical-Learning-using-R

This is a Statistical Learning application which will consist of various Machine Learning algorithms and their implementation in R done by me and their in depth interpretation.Documents and reports related to the below mentioned techniques can be found on my Rpubs profile.

Stars: ✭ 27 (-12.9%)

Mutual labels: machine-learning-algorithms, clustering-algorithm

Genetic-Algorithm-on-K-Means-Clustering

Implementing Genetic Algorithm on K-Means and compare with K-Means++

Stars: ✭ 37 (+19.35%)

Mutual labels: k-means, clustering-algorithm

DataProfiler

What's in your data? Extract schema, statistics and entities from datasets

Stars: ✭ 843 (+2619.35%)

Mutual labels: pandas, data-analysis

Hdbscan

A high performance implementation of HDBSCAN clustering.

Stars: ✭ 2,032 (+6454.84%)

Mutual labels: machine-learning-algorithms, clustering-algorithm

Nmflibrary

MATLAB library for non-negative matrix factorization (NMF): Version 1.8.1

Stars: ✭ 153 (+393.55%)

Mutual labels: machine-learning-algorithms, data-analysis

Datscan

DatScan is an initiative to build an open-source CMS that will have the capability to solve any problem using data Analysis just with the help of various modules and a vast standardized module library

Stars: ✭ 13 (-58.06%)

Mutual labels: pandas, data-analysis

View All Similar Projects ➔

Online-Course-Recommendation-System

Built on data fetched from Pluralsight's course API fetched results. Refer, their API to use the recent most data.
Works with model trained on K-means unsupervised clustering algorithm on text data vectorized tf-idf algorithm.

Architectural Diagram of Tool

Experiments

Experiment 1: Using k=8 as categories that are present on Pluralsight are eight in number. Just, a basic intuition to get started with.

Issue with this approach is that that it results in an higher SSE error as compared other higher values of k as shown in below figure.

Elbow Experiment Plot

Elbow/Knee method is a good visualization experiment to know where the optimum number of clusters are present. Ideally at a point where the error decreases drastically.

Experiment 2: Now, using k=30 as pointed by Elbow's method. The clusters formed are much more meaningful in this experiment. Observe, by printing the top 15 terms of both trained models with k=8 and 30 respectively. Also, for comparison see output screenshots of each clustering experiment.

To Get Started

Extract out the finalized_model_k_8 or 30.zip stored model's zip file first. As, it will be used by recommend_util.py file.
Extract out the courses.csv file. As, it will be used by recommend_util.py file for loading the data.
Simply, execute 'python3 recommend_util.py'. It will return results for some pre-loaded queries that are already inserted in this file.

Training Model

Run command 'python3 model_train_k_8.py or python3 model_train_k_30.py' for training the k-means model and storing it.

Outputs

Observe the outputs for k = 8 and 30 for certain pre-defined courses. It is clearly visible that k=30 returns better recommendations based on the clustering algorithm of respective trained models.

Clusterization Output For K = 8

Clusterization Output For K = 30

Limitations

These k-means models perform good when it comes to predicting categories about courses that are in good proportion and is having reasonable number of keywords being associated with them. But, for courses that are less in number of all categories. The recommendation or predictions of correlated courses associated with a cluster are not good. This can be clearly seen from example below. For these following machine learning courses queries. We received cluster results that are nowhere near good being good recommendations. It highlights the fact that even after minimization of loss via SSE the cluster formed are not that accurate. As, they are correlating course descriptions for different courses that are not related to each other.

Machine learning Course Queries:

play-by-play-machine-learning-exposed
microsoft-cognitive-services-machine-learning
python-scikit-learn-building-machine-learning-models
pandas-data-wrangling-machine-learning-engineers
xgboost-python-scikit-learn-machine-learning

Cluster results for above mentioned queries

A Possible Hope: LDA(Latent Dirichlet Allocation) methods based on BOW model and TF-IDF model

It is a type of statistical modeling for clustering the abstract topics that occur in document collections. It classifies text in a document to a particular topic. And with that idea in mind we'll be aiming to construct clusters around our courses.csv data.

Just run python3 lda_train.py command.

Every execution step and information related to it will get printed along with topic analysis related to each word. In this file main highlights are elimination of extreme values from corpus. Also, constructing LDA based on TF-IDF model from BOW corpus and a standalone LDA based on BOW model.

But, still predictions related to machine learning queries for example as mentioned above. Are not good for this model also. Refer figure below, to see the words related predicted with respect to a ML course related query.

Query result for 'Play by Play: Machine Learning Exposed'

The major governing term in the above result 'play' is often repeated in the dataset and is often associated with multiple domain courses that are not related to machine learning. Hence, similar problems keep on existing even after training such good models. Another, idea is now to pick up only key phrases related to domains only for obtaining better results. Or train neural networks for better prediction results along with above domain related pharses idea.

Even, after that results given by this model are slightly better as compared to earlier trained models. Hence, it is better to use lda_train.py to train your model and saving it. Plus, making changes accordingly to construct new recommend_util.py file.

Requirements

Make sure python(3.x), pandas, sklearn, pickle, numpy are present your system for running this module.

Kudos !!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

ashishrana160796 / online-course-recommendation-system

Programming Languages

Labels

Projects that are alternatives of or similar to online-course-recommendation-system

Online-Course-Recommendation-System

Architectural Diagram of Tool

Experiments

Elbow Experiment Plot

To Get Started

Training Model

Outputs

Clusterization Output For K = 8

Clusterization Output For K = 30

Limitations

Cluster results for above mentioned queries

A Possible Hope: LDA(Latent Dirichlet Allocation) methods based on BOW model and TF-IDF model

Query result for 'Play by Play: Machine Learning Exposed'

Requirements

Kudos !!