Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

ML based projects such as Spam Classification, Time Series Analysis, Text Classification using Random Forest, Deep Learning, Bayesian, Xgboost in Python

Stars: ✭ 127 (+605.56%)

Mutual labels: text-classification, svm, random-forest

handson-ml

도서 "핸즈온 머신러닝"의 예제와 연습문제를 담은 주피터 노트북입니다.

Stars: ✭ 285 (+1483.33%)

Mutual labels: random-forest, svm, xgboost

Amazon-Fine-Food-Review

Machine learning algorithm such as KNN,Naive Bayes,Logistic Regression,SVM,Decision Trees,Random Forest,k means and Truncated SVD on amazon fine food review

Stars: ✭ 28 (+55.56%)

Mutual labels: random-forest, svm, logistic-regression

HumanOrRobot

a solution for competition of kaggle `Human or Robot`

Stars: ✭ 16 (-11.11%)

Mutual labels: xgboost, lightgbm

Arch-Data-Science

Archlinux PKGBUILDs for Data Science, Machine Learning, Deep Learning, NLP and Computer Vision

Stars: ✭ 92 (+411.11%)

Mutual labels: xgboost, lightgbm

Hyperparameter hunter

Easy hyperparameter optimization and automatic result saving across machine learning algorithms and libraries

Stars: ✭ 648 (+3500%)

Mutual labels: xgboost, lightgbm

HyperGBM

A full pipeline AutoML tool for tabular data

Stars: ✭ 172 (+855.56%)

Mutual labels: xgboost, lightgbm

View All Similar Projects ➔

Text-Classification-Benchmark

文本分类基准测试

测试分类器

贝叶斯
逻辑回归
线性 SVM
非线性 SVM(RBF)
随机森林
XGBoost
LightGBM

语料

文件名: FDU_NLP_corpus_seg_balanced.txt

描述: 不同领域的新闻、文献等 (中文)

格式: 已经分词, 每一行对应一篇文本. 具体格式如下

{分类名}@{文本}

{分类名}@{文本}

...

规模: 共 4050 篇(平衡语料)

类别: 共 9 个类别, 分别为: Art, Enviornment, Space, Sports, Computer, Politics, Economy, Agriculture, History.

来源: 复旦大学计算机信息与技术系国际数据库中心自然语言处理小组

特征处理

卡方校验(chi-square test) 进行特征选择, 共选择 1000 个特征词作为特征.
通过 TF-IDF 进行特征提取(向量化)

基准测试

基于 scikit-learn 自带模型的默认参数进行"5次交叉验证(cross validation)"

参考结果

不同算法模型对超参数调优存在差异, 以下结果仅供参考:

基于原始 TF-IDF 特征

+------------+----------+------------+------------------------+--------------------+---------------+
| classifier | fit_time | score_time |  test_precision_micro  | test_recall_micro  | test_f1_micro |
+------------+----------+------------+------------------------+--------------------+---------------+
|     NB     |  0.008   |   0.005    |         0.865          |       0.865        |     0.865     |
+------------+----------+------------+------------------------+--------------------+---------------+
|     LR     |  0.312   |   0.004    |         0.903          |       0.903        |     0.903     |
+------------+----------+------------+------------------------+--------------------+---------------+
|   L-SVM    |  0.124   |   0.004    |          0.91          |        0.91        |     0.91      |
+------------+----------+------------+------------------------+--------------------+---------------+
|  RBF-SVM   |  14.824  |   6.469    |         0.825          |       0.825        |     0.825     |
+------------+----------+------------+------------------------+--------------------+---------------+
|     RF     |  3.277   |   0.092    |         0.922          |       0.922        |     0.922     |
+------------+----------+------------+------------------------+--------------------+---------------+
|    XGB     |  32.498  |   0.169    |         0.938          |       0.938        |     0.938     |
+------------+----------+------------+------------------------+--------------------+---------------+
|    LGBM    |  37.79   |   0.162    |         0.942          |       0.942        |     0.942     |
+------------+----------+------------+------------------------+--------------------+---------------+

基于标准化(保留均值) TF-IDF 特征

备注: 不涉及中心化,原特征矩阵的稀疏性被保留.

StandardScaler(with_mean=False, with_std=True)

+------------+----------+------------+------------------------+--------------------+---------------+
| classifier | fit_time | score_time |  test_precision_micro  | test_recall_micro  | test_f1_micro |
+------------+----------+------------+------------------------+--------------------+---------------+
|     NB     |  0.022   |   0.008    |          0.86          |        0.86        |     0.86      |
+------------+----------+------------+------------------------+--------------------+---------------+
|     LR     |  1.154   |   0.006    |         0.894          |       0.894        |     0.894     |
+------------+----------+------------+------------------------+--------------------+---------------+
|   L-SVM    |  1.107   |   0.006    |         0.875          |       0.875        |     0.875     |
+------------+----------+------------+------------------------+--------------------+---------------+
|  RBF-SVM   |  10.972  |    6.79    |         0.896          |       0.896        |     0.896     |
+------------+----------+------------+------------------------+--------------------+---------------+
|     RF     |  1.997   |   0.073    |         0.921          |       0.921        |     0.921     |
+------------+----------+------------+------------------------+--------------------+---------------+
|    XGB     |  75.364  |   0.097    |         0.937          |       0.937        |     0.937     |
+------------+----------+------------+------------------------+--------------------+---------------+
|    LGBM    |  45.986  |   0.182    |         0.942          |       0.942        |     0.942     |
+------------+----------+------------+------------------------+--------------------+---------------+

基于标准化 TF-IDF 特征

备注: 涉及中心化,原特征矩阵的稀疏性已改变,实际上是一个稠密矩阵. 中心化引入负值特征, 故不进行贝叶斯测试.

StandardScaler(with_mean=True, with_std=True)

+------------+----------+------------+------------------------+--------------------+---------------+
| classifier | fit_time | score_time |  test_precision_micro  | test_recall_micro  | test_f1_micro |
+------------+----------+------------+------------------------+--------------------+---------------+
|     LR     |  10.084  |   0.006    |         0.888          |       0.888        |     0.888     |
+------------+----------+------------+------------------------+--------------------+---------------+
|   L-SVM    |  17.493  |   0.006    |         0.867          |       0.867        |     0.867     |
+------------+----------+------------+------------------------+--------------------+---------------+
|  RBF-SVM   |  9.889   |   6.029    |         0.896          |       0.896        |     0.896     |
+------------+----------+------------+------------------------+--------------------+---------------+
|     RF     |  1.897   |   0.074    |         0.921          |       0.921        |     0.921     |
+------------+----------+------------+------------------------+--------------------+---------------+
|    XGB     |  75.652  |   0.102    |         0.937          |       0.937        |     0.937     |
+------------+----------+------------+------------------------+--------------------+---------------+
|    LGBM    |  49.342  |   0.169    |         0.944          |       0.944        |     0.944     |
+------------+----------+------------+------------------------+--------------------+---------------+

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 18

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗