All Projects → qyfang → TextClassification

qyfang / TextClassification

Licence: other
基于scikit-learn实现对新浪新闻的文本分类,数据集为100w篇文档,总计10类,测试集与训练集1:1划分。分类算法采用SVM和Bayes,其中Bayes作为baseline。

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to TextClassification

text-classification-cn
中文文本分类实践,基于搜狗新闻语料库,采用传统机器学习方法以及预训练模型等方法
Stars: ✭ 81 (-5.81%)
Mutual labels:  text-classification, svm, scikit-learn
Artificial Adversary
🗣️ Tool to generate adversarial text examples and test machine learning models against them
Stars: ✭ 348 (+304.65%)
Mutual labels:  data-mining, text-classification
2018 Dc Datagrand Textintelprocess
2018-DC-“达观杯”文本智能处理挑战赛:冠军 (1st/3131)
Stars: ✭ 260 (+202.33%)
Mutual labels:  data-mining, text-classification
Sktime
A unified framework for machine learning with time series
Stars: ✭ 4,741 (+5412.79%)
Mutual labels:  data-mining, scikit-learn
PracticalMachineLearning
A collection of ML related stuff including notebooks, codes and a curated list of various useful resources such as books and softwares. Almost everything mentioned here is free (as speech not free food) or open-source.
Stars: ✭ 60 (-30.23%)
Mutual labels:  data-mining, scikit-learn
NIDS-Intrusion-Detection
Simple Implementation of Network Intrusion Detection System. KddCup'99 Data set is used for this project. kdd_cup_10_percent is used for training test. correct set is used for test. PCA is used for dimension reduction. SVM and KNN supervised algorithms are the classification algorithms of project. Accuracy : %83.5 For SVM , %80 For KNN
Stars: ✭ 45 (-47.67%)
Mutual labels:  data-mining, svm
Rmdl
RMDL: Random Multimodel Deep Learning for Classification
Stars: ✭ 375 (+336.05%)
Mutual labels:  data-mining, text-classification
Shallowlearn
An experiment about re-implementing supervised learning models based on shallow neural network approaches (e.g. fastText) with some additional exclusive features and nice API. Written in Python and fully compatible with Scikit-learn.
Stars: ✭ 196 (+127.91%)
Mutual labels:  text-classification, scikit-learn
Python Machine Learning Book
The "Python Machine Learning (1st edition)" book code repository and info resource
Stars: ✭ 11,428 (+13188.37%)
Mutual labels:  data-mining, scikit-learn
Pyss3
A Python package implementing a new machine learning model for text classification with visualization tools for Explainable AI
Stars: ✭ 191 (+122.09%)
Mutual labels:  data-mining, text-classification
Amazing Feature Engineering
Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms. Feature engineering can be considered as applied machine learning itself.
Stars: ✭ 218 (+153.49%)
Mutual labels:  data-mining, scikit-learn
multiscorer
A module for allowing the use of multiple metric functions in scikit's cross_val_score
Stars: ✭ 21 (-75.58%)
Mutual labels:  data-mining, scikit-learn
Algorithmic-Trading
Algorithmic trading using machine learning.
Stars: ✭ 102 (+18.6%)
Mutual labels:  data-mining, scikit-learn
Kaggle-project-list
Summary of my projects on kaggle
Stars: ✭ 20 (-76.74%)
Mutual labels:  data-mining, text-classification
Text Classification
Machine Learning and NLP: Text Classification using python, scikit-learn and NLTK
Stars: ✭ 239 (+177.91%)
Mutual labels:  text-classification, scikit-learn
Text mining resources
Resources for learning about Text Mining and Natural Language Processing
Stars: ✭ 358 (+316.28%)
Mutual labels:  data-mining, text-classification
kenchi
A scikit-learn compatible library for anomaly detection
Stars: ✭ 36 (-58.14%)
Mutual labels:  data-mining, scikit-learn
Doc2vec
📓 Long(er) text representation and classification using Doc2Vec embeddings
Stars: ✭ 92 (+6.98%)
Mutual labels:  text-classification, scikit-learn
Ml Projects
ML based projects such as Spam Classification, Time Series Analysis, Text Classification using Random Forest, Deep Learning, Bayesian, Xgboost in Python
Stars: ✭ 127 (+47.67%)
Mutual labels:  text-classification, svm
Model Describer
model-describer : Making machine learning interpretable to humans
Stars: ✭ 22 (-74.42%)
Mutual labels:  data-mining, scikit-learn

新浪新闻文本分类

语料库构建

本项目的语料来源新浪新闻网,通过 spider.py 爬虫模块获得全部语料,总计获得10类新闻文本,每一类新闻文本有10w篇。

数据预处理

本项目的数据预处理包括:分词处理,去噪,向量化,由 stopwords.py 模块、text2term.py 模块、vectorizer.py 模块实现。

  • 本项目借助第三方库 jieba 完成文本的分词处理。

  • 通过停用词表去除中文停用词,通过正则表达式去除数字(中文数字&阿拉伯数字)。

filter_pattern = re.compile(ur'[-+]?[\w\d]+|零|一|二|三|四|五|六|七|八|九|十|百|千|万|亿')
  • 使用进程池并发执行数据的分词和去噪,加快数据预处理的过程。

  • 把数据集1:1划分为训练集和测试集,各50w篇文档。

  • 借助scikit-learn提供的CountVectorizer类完成向量化,得到训练集和测试集两个文本的特征矩阵,矩阵类型为稀疏矩阵。

  • 去除文档中文档频率小于0.1%的特征,这些特征我们认为出现的频率实在太低同时也不可能为某类文档的局部特征,以此完成降维,最终特征矩阵的维度大约为19543维。

朴素贝叶斯分类

本项目使用朴素贝叶斯作为本项目文本分类的baseline,由 baseline.py 模块实现。

  • 平滑处理

  • 处理零概率

  • 最终分类结果: 最高召回率:0.95 | 最低召回率:0.46 | 平均召回率:0.79 最高精确度:0.96 | 最低精确度:0.55 | 平均精确度:0.78 最高F1测度:0.93 | 最低F1测度:0.50 | 平均F1测度:0.79

SVM分类

本项目使用SVM作为最终的文本分类器,由 svm.py 模块实现其中SVM的核函数选用线性核,特征矩阵投入训练前经过词频加权.

  • 借助TfidfTransformer使用TF-IDF对词频进行加权

  • 选用线性核LinearSVC

  • 结合5折交叉验证和网格搜索GridSearchCV完成调参

  • 最终分类结果: 最高召回率:0.99 | 最低召回率:0.77 | 平均召回率:0.90 最高精确度:0.98 | 最低精确度:0.77 | 平均精确度:0.90 最高F1测度:0.99 | 最低F1测度:0.77 | 平均F1测度:0.90

可视化

比较SVM分类器和贝叶斯分类器的分类性能,通过可视化的方式比较两者的预测结果,由 viewer.py 模块实现。

混淆矩阵热力图

混淆矩阵热力图

性能对比直方图

性能对比直方图

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].