All Projects → gokriznastic → 20-newsgroups_text-classification

gokriznastic / 20-newsgroups_text-classification

Licence: other
"20 newsgroups" dataset - Text Classification using Multinomial Naive Bayes in Python.

Programming Languages

Jupyter Notebook
11667 projects

Projects that are alternatives of or similar to 20-newsgroups text-classification

text-classification-cn
中文文本分类实践,基于搜狗新闻语料库,采用传统机器学习方法以及预训练模型等方法
Stars: ✭ 81 (+97.56%)
Mutual labels:  text-classification, naive-bayes, scikit-learn
Nepali-News-Classifier
Text Classification of Nepali Language Document. This Mini Project was done for the partial fulfillment of NLP Course : COMP 473.
Stars: ✭ 13 (-68.29%)
Mutual labels:  text-classification, naive-bayes-classifier
Dat8
General Assembly's 2015 Data Science course in Washington, DC
Stars: ✭ 1,516 (+3597.56%)
Mutual labels:  naive-bayes, scikit-learn
cnn-text-classification
Text classification with Convolution Neural Networks on Yelp, IMDB & sentence polarity dataset v1.0
Stars: ✭ 108 (+163.41%)
Mutual labels:  text-classification, multiclass-classification
100 Days Of Ml Code
100 Days of ML Coding
Stars: ✭ 33,641 (+81951.22%)
Mutual labels:  scikit-learn, naive-bayes-classifier
sentiment-analysis-using-python
Large Data Analysis Course Project
Stars: ✭ 23 (-43.9%)
Mutual labels:  naive-bayes, naive-bayes-classifier
GaussianNB
Gaussian Naive Bayes (GaussianNB) classifier
Stars: ✭ 17 (-58.54%)
Mutual labels:  naive-bayes, naive-bayes-classifier
Shallowlearn
An experiment about re-implementing supervised learning models based on shallow neural network approaches (e.g. fastText) with some additional exclusive features and nice API. Written in Python and fully compatible with Scikit-learn.
Stars: ✭ 196 (+378.05%)
Mutual labels:  text-classification, scikit-learn
Doc2vec
📓 Long(er) text representation and classification using Doc2Vec embeddings
Stars: ✭ 92 (+124.39%)
Mutual labels:  text-classification, scikit-learn
Text Analytics With Python
Learn how to process, classify, cluster, summarize, understand syntax, semantics and sentiment of text data with the power of Python! This repository contains code and datasets used in my book, "Text Analytics with Python" published by Apress/Springer.
Stars: ✭ 1,132 (+2660.98%)
Mutual labels:  text-classification, scikit-learn
Machine Learning With Python
Practice and tutorial-style notebooks covering wide variety of machine learning techniques
Stars: ✭ 2,197 (+5258.54%)
Mutual labels:  naive-bayes, scikit-learn
TextClassification
基于scikit-learn实现对新浪新闻的文本分类,数据集为100w篇文档,总计10类,测试集与训练集1:1划分。分类算法采用SVM和Bayes,其中Bayes作为baseline。
Stars: ✭ 86 (+109.76%)
Mutual labels:  text-classification, scikit-learn
Naive-Bayes-Text-Classifier-in-Java
Naive Bayes Classification used to classify movie reviews as positive or negative
Stars: ✭ 18 (-56.1%)
Mutual labels:  text-classification, naive-bayes-classifier
Text Classification
Machine Learning and NLP: Text Classification using python, scikit-learn and NLTK
Stars: ✭ 239 (+482.93%)
Mutual labels:  text-classification, scikit-learn
bayes
naive bayes in php
Stars: ✭ 61 (+48.78%)
Mutual labels:  naive-bayes, naive-bayes-classifier
emoji-prediction
🤓🔮🔬 Emoji prediction from a text using machine learning
Stars: ✭ 41 (+0%)
Mutual labels:  scikit-learn
Word-Embeddings-and-Document-Vectors
An evaluation of word-embeddings for classification
Stars: ✭ 32 (-21.95%)
Mutual labels:  naive-bayes-classifier
bayarea-2019-scikit-sprint
Bay Area WiMLDS scikit-learn open source sprint (Nov 2, 2019)
Stars: ✭ 16 (-60.98%)
Mutual labels:  scikit-learn
lapis-bayes
Naive Bayes classifier for use in Lua
Stars: ✭ 26 (-36.59%)
Mutual labels:  naive-bayes-classifier
pycobra
python library implementing ensemble methods for regression, classification and visualisation tools including Voronoi tesselations.
Stars: ✭ 111 (+170.73%)
Mutual labels:  scikit-learn

Text Classification in Python using the 20 newsgroup dataset.

"20 newsgroups" dataset - Text Classification using Python.

Dataset

For dataset I used the famous "20 Newsgroups" dataset.

The data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. I've included the dataset in the repo, located at 20_newsgroups\ directory.

You can find the dataset freely here.

The code

The code is pretty straight forward and well documented. The preprocessing of the documents and the implementation of classifiers have been done from scratch and then the results have been compared to inbuilt sklearn's classifiers. The code has been arranged in form of IPython Notebooks, each notebook corresponds to a particular "classifier" or "technique" used for classifying the dataset.

Requirements

  • python 2.7 or above

  • python modules:

    • scikit-learn
    • numpy
    • matplotlib

Experiments

For each experiment we use a "feature vector", a "classifier" and a train-test splitting strategy.

Experiment 1: BOW - NB - 25% test

In this experiment we use a Bag Of Words (BOW) representation of each document containing Term Frequency. And also a Multinomial Naive Bayes (NB) classifier.

Experiment 12: TF-IDF - NB - 25% test

Ongoing

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].