A collection of ML related stuff including notebooks, codes and a curated list of various useful resources such as books and softwares. Almost everything mentioned here is free (as speech not free food) or open-source.

Stars: ✭ 60 (-30.23%)

Mutual labels: data-mining, scikit-learn

NIDS-Intrusion-Detection

Simple Implementation of Network Intrusion Detection System. KddCup'99 Data set is used for this project. kdd_cup_10_percent is used for training test. correct set is used for test. PCA is used for dimension reduction. SVM and KNN supervised algorithms are the classification algorithms of project. Accuracy : %83.5 For SVM , %80 For KNN

Stars: ✭ 45 (-47.67%)

Mutual labels: data-mining, svm

Rmdl

RMDL: Random Multimodel Deep Learning for Classification

Stars: ✭ 375 (+336.05%)

Mutual labels: data-mining, text-classification

Shallowlearn

An experiment about re-implementing supervised learning models based on shallow neural network approaches (e.g. fastText) with some additional exclusive features and nice API. Written in Python and fully compatible with Scikit-learn.

Stars: ✭ 196 (+127.91%)

Mutual labels: text-classification, scikit-learn

Python Machine Learning Book

The "Python Machine Learning (1st edition)" book code repository and info resource

Stars: ✭ 11,428 (+13188.37%)

Mutual labels: data-mining, scikit-learn

Pyss3

A Python package implementing a new machine learning model for text classification with visualization tools for Explainable AI

Stars: ✭ 191 (+122.09%)

Mutual labels: data-mining, text-classification

Amazing Feature Engineering

Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms. Feature engineering can be considered as applied machine learning itself.

Stars: ✭ 218 (+153.49%)

Mutual labels: data-mining, scikit-learn

multiscorer

A module for allowing the use of multiple metric functions in scikit's cross_val_score

Stars: ✭ 21 (-75.58%)

Mutual labels: data-mining, scikit-learn

Algorithmic-Trading

Algorithmic trading using machine learning.

Stars: ✭ 102 (+18.6%)

Mutual labels: data-mining, scikit-learn

Kaggle-project-list

Summary of my projects on kaggle

Stars: ✭ 20 (-76.74%)

Mutual labels: data-mining, text-classification

Text Classification

Machine Learning and NLP: Text Classification using python, scikit-learn and NLTK

Stars: ✭ 239 (+177.91%)

Mutual labels: text-classification, scikit-learn

Text mining resources

Resources for learning about Text Mining and Natural Language Processing

Stars: ✭ 358 (+316.28%)

Mutual labels: data-mining, text-classification

kenchi

A scikit-learn compatible library for anomaly detection

Stars: ✭ 36 (-58.14%)

Mutual labels: data-mining, scikit-learn

Doc2vec

📓 Long(er) text representation and classification using Doc2Vec embeddings

Stars: ✭ 92 (+6.98%)

Mutual labels: text-classification, scikit-learn

Ml Projects

ML based projects such as Spam Classification, Time Series Analysis, Text Classification using Random Forest, Deep Learning, Bayesian, Xgboost in Python

Stars: ✭ 127 (+47.67%)

Mutual labels: text-classification, svm

Model Describer

model-describer : Making machine learning interpretable to humans

Stars: ✭ 22 (-74.42%)

Mutual labels: data-mining, scikit-learn

View All Similar Projects ➔

新浪新闻文本分类

语料库构建

本项目的语料来源新浪新闻网，通过 spider.py 爬虫模块获得全部语料，总计获得10类新闻文本，每一类新闻文本有10w篇。

借助新浪新闻网的一个api获取新闻文本，api的url为http://api.roll.news.sina.com.cn/zt_list?
使用进程池并发执行爬虫，加快抓取速度。

数据预处理

本项目的数据预处理包括：分词处理，去噪，向量化，由 stopwords.py 模块、text2term.py 模块、vectorizer.py 模块实现。

本项目借助第三方库 jieba 完成文本的分词处理。
通过停用词表去除中文停用词，通过正则表达式去除数字（中文数字&阿拉伯数字）。

filter_pattern = re.compile(ur'[-+]?[\w\d]+|零|一|二|三|四|五|六|七|八|九|十|百|千|万|亿')

使用进程池并发执行数据的分词和去噪，加快数据预处理的过程。
把数据集1:1划分为训练集和测试集，各50w篇文档。
借助scikit-learn提供的CountVectorizer类完成向量化，得到训练集和测试集两个文本的特征矩阵，矩阵类型为稀疏矩阵。
去除文档中文档频率小于0.1%的特征，这些特征我们认为出现的频率实在太低同时也不可能为某类文档的局部特征，以此完成降维，最终特征矩阵的维度大约为19543维。

朴素贝叶斯分类

本项目使用朴素贝叶斯作为本项目文本分类的baseline，由 baseline.py 模块实现。

平滑处理
处理零概率
最终分类结果：最高召回率:0.95 | 最低召回率:0.46 | 平均召回率:0.79 最高精确度:0.96 | 最低精确度:0.55 | 平均精确度:0.78 最高F1测度:0.93 | 最低F1测度:0.50 | 平均F1测度:0.79

SVM分类

本项目使用SVM作为最终的文本分类器，由 svm.py 模块实现其中SVM的核函数选用线性核，特征矩阵投入训练前经过词频加权.

借助TfidfTransformer使用TF-IDF对词频进行加权
选用线性核LinearSVC
结合5折交叉验证和网格搜索GridSearchCV完成调参
最终分类结果：最高召回率:0.99 | 最低召回率:0.77 | 平均召回率:0.90 最高精确度:0.98 | 最低精确度:0.77 | 平均精确度:0.90 最高F1测度:0.99 | 最低F1测度:0.77 | 平均F1测度:0.90

可视化

比较SVM分类器和贝叶斯分类器的分类性能，通过可视化的方式比较两者的预测结果，由 viewer.py 模块实现。

混淆矩阵热力图

性能对比直方图

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

qyfang / TextClassification

Programming Languages

Labels

Projects that are alternatives of or similar to TextClassification

新浪新闻文本分类

语料库构建

数据预处理

朴素贝叶斯分类

SVM分类

可视化

混淆矩阵热力图

性能对比直方图