All Projects → muatik → Naive Bayes Classifier

muatik / Naive Bayes Classifier

Licence: mit
yet another general purpose naive bayesian classifier.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Naive Bayes Classifier

Gpstuff
GPstuff - Gaussian process models for Bayesian analysis
Stars: ✭ 106 (-34.57%)
Mutual labels:  bayesian
Hbayesdm
Hierarchical Bayesian modeling of RLDM tasks, using R & Python
Stars: ✭ 124 (-23.46%)
Mutual labels:  bayesian
Awesome Decision Tree Papers
A collection of research papers on decision, classification and regression trees with implementations.
Stars: ✭ 1,908 (+1077.78%)
Mutual labels:  classifier
Tensorflow Object Detection Tutorial
The purpose of this tutorial is to learn how to install and prepare TensorFlow framework to train your own convolutional neural network object detection classifier for multiple objects, starting from scratch
Stars: ✭ 113 (-30.25%)
Mutual labels:  classifier
Bayesiantracker
Bayesian multi-object tracking
Stars: ✭ 121 (-25.31%)
Mutual labels:  bayesian
Naivebayes
📊 Naive Bayes classifier for JavaScript
Stars: ✭ 127 (-21.6%)
Mutual labels:  classifier
Url Classification
Machine learning to classify Malicious (Spam)/Benign URL's
Stars: ✭ 95 (-41.36%)
Mutual labels:  classifier
Emlearn
Machine Learning inference engine for Microcontrollers and Embedded devices
Stars: ✭ 154 (-4.94%)
Mutual labels:  classifier
Statistical Rethinking
An interactive online reading of McElreath's Statistical Rethinking
Stars: ✭ 123 (-24.07%)
Mutual labels:  bayesian
Modelselection
Tutorial on model assessment, model selection and inference after model selection
Stars: ✭ 139 (-14.2%)
Mutual labels:  bayesian
Psycho.r
An R package for experimental psychologists
Stars: ✭ 113 (-30.25%)
Mutual labels:  bayesian
Keras transfer cifar10
Object classification with CIFAR-10 using transfer learning
Stars: ✭ 120 (-25.93%)
Mutual labels:  classifier
Dl Uncertainty
"What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?", NIPS 2017 (unofficial code).
Stars: ✭ 130 (-19.75%)
Mutual labels:  bayesian
Sytora
A sophisticated smart symptom search engine
Stars: ✭ 111 (-31.48%)
Mutual labels:  classifier
Scene Text Recognition
Scene text detection and recognition based on Extremal Region(ER)
Stars: ✭ 146 (-9.88%)
Mutual labels:  classifier
Monkeylearn
⛔️ ARCHIVED ⛔️ 🐒 R package for text analysis with Monkeylearn 🐒
Stars: ✭ 95 (-41.36%)
Mutual labels:  classifier
Digit Recognizer
A Machine Learning classifier for recognizing the digits for humans.
Stars: ✭ 126 (-22.22%)
Mutual labels:  classifier
Speech signal processing and classification
Front-end speech processing aims at extracting proper features from short- term segments of a speech utterance, known as frames. It is a pre-requisite step toward any pattern recognition problem employing speech or audio (e.g., music). Here, we are interesting in voice disorder classification. That is, to develop two-class classifiers, which can discriminate between utterances of a subject suffering from say vocal fold paralysis and utterances of a healthy subject.The mathematical modeling of the speech production system in humans suggests that an all-pole system function is justified [1-3]. As a consequence, linear prediction coefficients (LPCs) constitute a first choice for modeling the magnitute of the short-term spectrum of speech. LPC-derived cepstral coefficients are guaranteed to discriminate between the system (e.g., vocal tract) contribution and that of the excitation. Taking into account the characteristics of the human ear, the mel-frequency cepstral coefficients (MFCCs) emerged as descriptive features of the speech spectral envelope. Similarly to MFCCs, the perceptual linear prediction coefficients (PLPs) could also be derived. The aforementioned sort of speaking tradi- tional features will be tested against agnostic-features extracted by convolu- tive neural networks (CNNs) (e.g., auto-encoders) [4]. The pattern recognition step will be based on Gaussian Mixture Model based classifiers,K-nearest neighbor classifiers, Bayes classifiers, as well as Deep Neural Networks. The Massachussets Eye and Ear Infirmary Dataset (MEEI-Dataset) [5] will be exploited. At the application level, a library for feature extraction and classification in Python will be developed. Credible publicly available resources will be 1used toward achieving our goal, such as KALDI. Comparisons will be made against [6-8].
Stars: ✭ 155 (-4.32%)
Mutual labels:  classifier
Miscellaneous R Code
Code that might be useful to others for learning/demonstration purposes, specifically along the lines of modeling and various algorithms. Now almost entirely superseded by the models-by-example repo.
Stars: ✭ 146 (-9.88%)
Mutual labels:  bayesian
Pecan
The Predictive Ecosystem Analyzer (PEcAn) is an integrated ecological bioinformatics toolbox.
Stars: ✭ 132 (-18.52%)
Mutual labels:  bayesian

Naive Bayesian Classifier

yet another general purpose Naive Bayesian classifier.

##Installation You can install this package using the following pip command:

$ sudo pip install naiveBayesClassifier

##Example

"""
Suppose you have some texts of news and know their categories.
You want to train a system with this pre-categorized/pre-classified 
texts. So, you have better call this data your training set.
"""
from naiveBayesClassifier import tokenizer
from naiveBayesClassifier.trainer import Trainer
from naiveBayesClassifier.classifier import Classifier

newsTrainer = Trainer(tokenizer.Tokenizer(stop_words = [], signs_to_remove = ["?!#%&"]))

# You need to train the system passing each text one by one to the trainer module.
newsSet =[
    {'text': 'not to eat too much is not enough to lose weight', 'category': 'health'},
    {'text': 'Russia is trying to invade Ukraine', 'category': 'politics'},
    {'text': 'do not neglect exercise', 'category': 'health'},
    {'text': 'Syria is the main issue, Obama says', 'category': 'politics'},
    {'text': 'eat to lose weight', 'category': 'health'},
    {'text': 'you should not eat much', 'category': 'health'}
]

for news in newsSet:
    newsTrainer.train(news['text'], news['category'])

# When you have sufficient trained data, you are almost done and can start to use
# a classifier.
newsClassifier = Classifier(newsTrainer.data, tokenizer.Tokenizer(stop_words = [], signs_to_remove = ["?!#%&"]))

# Now you have a classifier which can give a try to classifiy text of news whose
# category is unknown, yet.
unknownInstance = "Even if I eat too much, is not it possible to lose some weight"
classification = newsClassifier.classify(unknownInstance)

# the classification variable holds the possible categories sorted by 
# their probablity value
print classification

Note: Definitely you will need much more training data than the amount in the above example. Really, a few lines of text like in the example is out of the question to be sufficient training set.

##What is the Naive Bayes Theorem and Classifier It is needless to explain everything once again here. Instead, one of the most eloquent explanations is quoted here.

The following explanation is quoted from another Bayes classifier which is written in Go.

BAYESIAN CLASSIFICATION REFRESHER: suppose you have a set of classes (e.g. categories) C := {C_1, ..., C_n}, and a document D consisting of words D := {W_1, ..., W_k}. We wish to ascertain the probability that the document belongs to some class C_j given some set of training data associating documents and classes.

By Bayes' Theorem, we have that

P(C_j|D) = P(D|C_j)*P(C_j)/P(D).

The LHS is the probability that the document belongs to class C_j given the document itself (by which is meant, in practice, the word frequencies occurring in this document), and our program will calculate this probability for each j and spit out the most likely class for this document.

P(C_j) is referred to as the "prior" probability, or the probability that a document belongs to C_j in general, without seeing the document first. P(D|C_j) is the probability of seeing such a document, given that it belongs to C_j. Here, by assuming that words appear independently in documents (this being the "naive" assumption), we can estimate

P(D|C_j) ~= P(W_1|C_j)*...*P(W_k|C_j)

where P(W_i|C_j) is the probability of seeing the given word in a document of the given class. Finally, P(D) can be seen as merely a scaling factor and is not strictly relevant to classificiation, unless you want to normalize the resulting scores and actually see probabilities. In this case, note that

P(D) = SUM_j(P(D|C_j)*P(C_j))

One practical issue with performing these calculations is the possibility of float64 underflow when calculating P(D|C_j), as individual word probabilities can be arbitrarily small, and a document can have an arbitrarily large number of them. A typical method for dealing with this case is to transform the probability to the log domain and perform additions instead of multiplications:

log P(C_j|D) ~ log(P(C_j)) + SUM_i(log P(W_i|C_j))

where i = 1, ..., k. Note that by doing this, we are discarding the scaling factor P(D) and our scores are no longer probabilities; however, the monotonic relationship of the scores is preserved by the log function.

If you are very curious about Naive Bayes Theorem, you may find the following list helpful:

#Improvements This classifier uses a very simple tokenizer which is just a module to split sentences into words. If your training set is large, you can rely on the available tokenizer, otherwise you need to have a better tokenizer specialized to the language of your training texts.

TODO

  • inline docs
  • unit-tests

AUTHORS

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].