Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → yassersouri → Classify Text

yassersouri / Classify Text

"20 Newsgroups" text classification with python

Programming Languages

139335 projects - #7 most used programming language

Labels

machine-learning text-classification

Projects that are alternatives of or similar to Classify Text

ConText v4: Neural networks for text categorization

Stars: ✭ 120 (-19.46%)

Mutual labels: text-classification

Textclassify with bert

使用BERT模型做文本分类；面向工业用途

Stars: ✭ 128 (-14.09%)

Mutual labels: text-classification

Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

Stars: ✭ 143 (-4.03%)

Mutual labels: text-classification

Python Stop Words

Get list of common stop words in various languages in Python

Stars: ✭ 122 (-18.12%)

Mutual labels: text-classification

ML based projects such as Spam Classification, Time Series Analysis, Text Classification using Random Forest, Deep Learning, Bayesian, Xgboost in Python

Stars: ✭ 127 (-14.77%)

Mutual labels: text-classification

Hierarchical Multi Label Text Classification

The code of CIKM'19 paper《Hierarchical Multi-label Text Classification: An Attention-based Recurrent Network Approach》

Stars: ✭ 133 (-10.74%)

Mutual labels: text-classification

Bdci2017 Minglue

BDCI2017-让AI当法官，决赛第四（4/415）https://www.datafountain.cn/competitions/277/details

Stars: ✭ 118 (-20.81%)

Mutual labels: text-classification

A web app to create and browse text visualizations for automated customer listening.

Stars: ✭ 143 (-4.03%)

Mutual labels: text-classification

FastText for Node.js

Stars: ✭ 127 (-14.77%)

Mutual labels: text-classification

Parselawdocuments

对收集的法律文档进行一系列分析，包括根据规范自动切分、案件相似度计算、案件聚类、法律条文推荐等（试验目前基于婚姻类案件，可扩展至其它领域）。

Stars: ✭ 138 (-7.38%)

Mutual labels: text-classification

Dan Jurafsky Chris Manning Nlp

My solution to the Natural Language Processing course made by Dan Jurafsky, Chris Manning in Winter 2012.

Stars: ✭ 124 (-16.78%)

Mutual labels: text-classification

Rcnn Text Classification

Tensorflow Implementation of "Recurrent Convolutional Neural Network for Text Classification" (AAAI 2015)

Stars: ✭ 127 (-14.77%)

Mutual labels: text-classification

export bert model for serving

Stars: ✭ 138 (-7.38%)

Mutual labels: text-classification

Nlp Pretrained Model

A collection of Natural language processing pre-trained models.

Stars: ✭ 122 (-18.12%)

Mutual labels: text-classification

Monkeylearn Python

Official Python client for the MonkeyLearn API. Build and consume machine learning models for language processing from your Python apps.

Stars: ✭ 143 (-4.03%)

Mutual labels: text-classification

Classifier multi label textcnn

multi-label，classifier，text classification，多标签文本分类，文本分类，BERT，ALBERT，multi-label-classification

Stars: ✭ 116 (-22.15%)

Mutual labels: text-classification

Nlp estimator tutorial

Educational material on using the TensorFlow Estimator framework for text classification

Stars: ✭ 131 (-12.08%)

Mutual labels: text-classification

Text Classification Demos

Neural models for Text Classification in Tensorflow, such as cnn, dpcnn, fasttext, bert ...

Stars: ✭ 144 (-3.36%)

Mutual labels: text-classification

UDA(Unsupervised Data Augmentation) implemented by pytorch

Stars: ✭ 143 (-4.03%)

Mutual labels: text-classification

Document Classifier Lstm

A bidirectional LSTM with attention for multiclass/multilabel text classification.

Stars: ✭ 136 (-8.72%)

Mutual labels: text-classification

View All Similar Projects ➔

Salam

Text Classification with python

This is an experiment. We want to classify text with python.

Dataset

For dataset I used the famous "Twenty Newsgrousps" dataset. You can find the dataset freely here.

I've included a subset of the dataset in the repo, located at dataset\ directory. This subset includes 6 of the 20 newsgroups: space, electronics, crypt, hockey, motorcycles and forsale.

When you run main.py it asks you for the root of the dataset. You can supply your own dataset assuming it has a similar directory structure.

UTF-8 incompatibility

Some of the supplied text files had incompatibility with utf-8!

Even textedit.app can't open those files. And they created problem in the code. So I'll delete them as part of the preprocessing.

Requirements

python 2.7
python modules:
- scikit-learn (v 0.11)
- scipy (v 0.10.1)
- colorama
- termcolor
- matplotlib (for use in plot.py)

The code

The code is pretty straight forward and well documented.

Running the code

python main.py

Experiments

For experiments I used the subset of the dataset (as described above). I assume that we like hockey, crypt and electronics newsgroups, and we dislike the others.

For each experiment we use a "feature vector", a "classifier" and a train-test splitting strategy.

Experiment 1: BOW - NB - 20% test

In this experiment we use a Bag Of Words (BOW) representation of each document. And also a Naive Bayes (NB) classifier.

We split the data, so that 20% of them remain for testing.

Results:

             precision    recall  f1-score   support

   dislikes       0.95      0.99      0.97       575
      likes       0.99      0.95      0.97       621

avg / total       0.97      0.97      0.97      1196

Experiment 2: TF - NB - 20% test

In this experiment we use a Term Frequency (TF) representation of each document. And also a Naive Bayes (NB) classifier.

We split the data, so that 20% of them remain for testing.

Results:

             precision    recall  f1-score   support

   dislikes       0.97      0.92      0.94       633
      likes       0.91      0.97      0.94       563

avg / total       0.94      0.94      0.94      1196

Experiment 3: TFIDF - NB - 20% test

In this experiment we use a TFIDF representation of each document. And also a Naive Bayes (NB) classifier.

We split the data, so that 20% of them remain for testing.

Results:

             precision    recall  f1-score   support

   dislikes       0.96      0.95      0.95       584
      likes       0.95      0.96      0.96       612

avg / total       0.95      0.95      0.95      1196

Experiment 4: TFIDF - SVM - 20% test

In this experiment we use a TFIDF representation of each document. And also a linear Support Vector Machine (SVM) classifier.

We split the data, so that 20% of them remain for testing.

Results:

             precision    recall  f1-score   support

   dislikes       0.96      0.97      0.97       587
      likes       0.97      0.96      0.97       609

avg / total       0.97      0.97      0.97      1196

Experiment 5: TFIDF - SVM - KFOLD

In this experiment we use a TFIDF representation of each document. And also a linear Support Vector Machine (SVM) classifier.

We split the data using Stratified K-Fold algorithm with k = 5.

Results:

Mean accuracy: 0.977 (+/- 0.002 std)

Experiment 5: BOW - NB - KFOLD

In this experiment we use a TFIDF representation of each document. And also a linear Support Vector Machine (SVM) classifier.

We split the data using Stratified K-Fold algorithm with k = 5.

Results:

Mean accuracy: 0.968 (+/- 0.002 std)

Experiment 6: TFIDF - SVM - 90% test

In this experiment we use a TFIDF representation of each document. And also a linear Support Vector Machine (SVM) classifier.

We split the data, so that 90% of them remain for testing! Only 10% of the dataset is used for training!

Results:

             precision    recall  f1-score   support

   dislikes       0.90      0.95      0.93      2689
      likes       0.95      0.90      0.92      2693

avg / total       0.92      0.92      0.92      5382

Experiment 7: TFIDF - SVM - KFOLD - 20 classes

In this experiment we use a TFIDF representation of each document. And also a linear Support Vector Machine (SVM) classifier.

We split the data using Stratified K-Fold algorithm with k = 5.

We also use the whole "Twenty Newsgroups" dataset, which has 20 classes.

Results:

Mean accuracy: 0.892 (+/- 0.001 std)

Experiment 7: BOW - NB - KFOLD - 20 classes

In this experiment we use a Bag Of Words (BOW) representation of each document. And also a Naive Bayes (NB) classifier.

We split the data using Stratified K-Fold algorithm with k = 5.

We also use the whole "Twenty Newsgroups" dataset, which has 20 classes.

Results:

Mean accuracy: 0.839 (+/- 0.003 std)

Experiment 8: TFIDF - 5-NN - Distance Weights - 20% test

In this experiment we use a TFIDF representation of each document. And also a K Nearest Neighbors (KNN) classifier with k = 5 and distance weights.

We split the data using Stratified K-Fold algorithm with k = 5.

Results:

             precision    recall  f1-score   support

   dislikes       0.93      0.88      0.90       608
      likes       0.88      0.93      0.90       588

avg / total       0.90      0.90      0.90      1196

Experiment 9: TFIDF - 5-NN - Uniform Weights - 20% test

In this experiment we use a TFIDF representation of each document. And also a K Nearest Neighbors (KNN) classifier with k = 5 and uniform weights.

We split the data using Stratified K-Fold algorithm with k = 5.

Results:

             precision    recall  f1-score   support

   dislikes       0.95      0.90      0.92       581
      likes       0.91      0.95      0.93       615

avg / total       0.93      0.93      0.93      1196

Experiment 10: TFIDF - 5-NN - Distance Weights - KFOLD

In this experiment we use a TFIDF representation of each document. And also a K Nearest Neighbors (KNN) classifier with k = 5 and distance weights.

We split the data using Stratified K-Fold algorithm with k = 5.

Results:

Mean accuracy: 0.908 (+/- 0.003 std)

Experiment 11: TFIDF - 5-NN - Distance Weights - KFOLD - 20 classes

In this experiment we use a TFIDF representation of each document. And also a K Nearest Neighbors (KNN) classifier with k = 5 and distance weights.

We split the data using Stratified K-Fold algorithm with k = 5.

We also use the whole "Twenty Newsgroups" dataset, which has 20 classes.

Results:

 Mean accuracy: 0.745 (+/- 0.002 std)

So What?

This experiments show that text classification can be effectively done by simple tools like TFIDF and SVM.

Any Conclusion?

We have found that TFIDF with SVM have the best performance.

TFIDF with SVM perform well both for 2-class problem and 20-class problem.

I would say if you want suggestion from me, use TFIDF with SVM.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 149

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗