All Projects → leafo → lapis-bayes

leafo / lapis-bayes

Licence: other
Naive Bayes classifier for use in Lua

Programming Languages

lua
6591 projects
MoonScript
45 projects
Makefile
30231 projects

Projects that are alternatives of or similar to lapis-bayes

lapis-community
Pluggable message board for Lapis powered websites
Stars: ✭ 41 (+57.69%)
Mutual labels:  moonscript, lapis
bayes
naive bayes in php
Stars: ✭ 61 (+134.62%)
Mutual labels:  classifier, naive-bayes-classifier
docker-lapis
Dockerized Lapis
Stars: ✭ 20 (-23.08%)
Mutual labels:  moonscript, lapis
Lapis
A web framework for Lua and OpenResty written in MoonScript
Stars: ✭ 2,621 (+9980.77%)
Mutual labels:  moonscript, lapis
Nepali-News-Classifier
Text Classification of Nepali Language Document. This Mini Project was done for the partial fulfillment of NLP Course : COMP 473.
Stars: ✭ 13 (-50%)
Mutual labels:  classifier, naive-bayes-classifier
chatto
Chatto is a minimal chatbot framework in Go.
Stars: ✭ 98 (+276.92%)
Mutual labels:  classifier, naive-bayes-classifier
sentiment-analysis-using-python
Large Data Analysis Course Project
Stars: ✭ 23 (-11.54%)
Mutual labels:  classifier, naive-bayes-classifier
naive-bayes-classifier
Implementing Naive Bayes Classification algorithm into PHP to classify given text as ham or spam. This application uses MySql as database.
Stars: ✭ 21 (-19.23%)
Mutual labels:  classifier, naive-bayes-classifier
Emlearn
Machine Learning inference engine for Microcontrollers and Embedded devices
Stars: ✭ 154 (+492.31%)
Mutual labels:  classifier
Dfl Cnn
This is a pytorch re-implementation of Learning a Discriminative Filter Bank Within a CNN for Fine-Grained Recognition
Stars: ✭ 245 (+842.31%)
Mutual labels:  classifier
Awesome Decision Tree Papers
A collection of research papers on decision, classification and regression trees with implementations.
Stars: ✭ 1,908 (+7238.46%)
Mutual labels:  classifier
Speech signal processing and classification
Front-end speech processing aims at extracting proper features from short- term segments of a speech utterance, known as frames. It is a pre-requisite step toward any pattern recognition problem employing speech or audio (e.g., music). Here, we are interesting in voice disorder classification. That is, to develop two-class classifiers, which can discriminate between utterances of a subject suffering from say vocal fold paralysis and utterances of a healthy subject.The mathematical modeling of the speech production system in humans suggests that an all-pole system function is justified [1-3]. As a consequence, linear prediction coefficients (LPCs) constitute a first choice for modeling the magnitute of the short-term spectrum of speech. LPC-derived cepstral coefficients are guaranteed to discriminate between the system (e.g., vocal tract) contribution and that of the excitation. Taking into account the characteristics of the human ear, the mel-frequency cepstral coefficients (MFCCs) emerged as descriptive features of the speech spectral envelope. Similarly to MFCCs, the perceptual linear prediction coefficients (PLPs) could also be derived. The aforementioned sort of speaking tradi- tional features will be tested against agnostic-features extracted by convolu- tive neural networks (CNNs) (e.g., auto-encoders) [4]. The pattern recognition step will be based on Gaussian Mixture Model based classifiers,K-nearest neighbor classifiers, Bayes classifiers, as well as Deep Neural Networks. The Massachussets Eye and Ear Infirmary Dataset (MEEI-Dataset) [5] will be exploited. At the application level, a library for feature extraction and classification in Python will be developed. Credible publicly available resources will be 1used toward achieving our goal, such as KALDI. Comparisons will be made against [6-8].
Stars: ✭ 155 (+496.15%)
Mutual labels:  classifier
Digit Recognizer
A Machine Learning classifier for recognizing the digits for humans.
Stars: ✭ 126 (+384.62%)
Mutual labels:  classifier
Scene Text Recognition
Scene text detection and recognition based on Extremal Region(ER)
Stars: ✭ 146 (+461.54%)
Mutual labels:  classifier
golinear
liblinear bindings for Go
Stars: ✭ 45 (+73.08%)
Mutual labels:  classifier
Naivebayes
📊 Naive Bayes classifier for JavaScript
Stars: ✭ 127 (+388.46%)
Mutual labels:  classifier
name2gender
Extrapolate gender from first names using Naïve-Bayes and PyTorch Char-RNN
Stars: ✭ 24 (-7.69%)
Mutual labels:  naive-bayes-classifier
DiseaseClassifier
Using a Naive Bayes Classifier gets possible diseases from symptoms
Stars: ✭ 23 (-11.54%)
Mutual labels:  naive-bayes-classifier
Errant
ERRor ANnotation Toolkit: Automatically extract and classify grammatical errors in parallel original and corrected sentences.
Stars: ✭ 208 (+700%)
Mutual labels:  classifier
Licenseclassifier
A License Classifier
Stars: ✭ 180 (+592.31%)
Mutual labels:  classifier

lapis-bayes

test

lapis-bayes is a Naive Bayes classifier for use in Lua. It can be used to classify text into any category that has been trained for ahead of time.

It's built on top of Lapis, but can be used as a standalone library as well. It requires PostgreSQL to store and parse training data.

Install

$ luarocks install lapis-bayes

Quick start

Create a new migration that look like this:

-- migrations.lua
{
  ...

  [1439944992]: require("lapis.bayes.schema").run_migrations
}

Run migrations:

$ lapis migrate

Train the classifier:

local bayes = require("lapis.bayes")

bayes.train_text("spam", "Cheap Prom Dresses 2014 - Buy discount Prom Dress")
bayes.train_text("spam", "Older Models Rolex Watches - View Large Selection of Rolex")
bayes.train_text("spam", "Hourglass Underwire - $125.00 : Professional swimwear")

bayes.train_text("ham", "Games I've downloaded so I remember them and stuff")
bayes.train_text("ham", "Secret Tunnel's Collection of Rad Games That I Dig")
bayes.train_text("ham", "Things I need to pay for when I get my credit card back")

Classify text:

assert("ham" == bayes.classify_text({"spam", "ham"}, "Games to download"))
assert("spam" == bayes.classify_text({"spam", "ham"}, "discount rolex watch"))

Reference

num_words = bayes.train_text(category, text)

local bayes = require("lapis.bayes")
bayes.train_text("spam", "Cheap Prom Dresses 2014 - Buy discount Prom Dress")

Inserts the tokenized words from text into the database associated with the category named category. Categories don't need to be created ahead of time, use any name you'd like. Later when classifying text you'll list all the eligible categories.

The tokenizer will normalize words and remove stop words before inserting into the database. The number of words kept from the original text is returned.

category, score = bayes.classify_text({category1, category2, ...}, text)

local bayes = require("lapis.bayes")
print bayes.classify_text({"spam", "ham"}, "Games to download")

Attempts to classify text. If none of the words in text are available in any of the listed categories then nil and an error message are returned.

Returns the name of the category that best matches, along with a probability score in natrual log (math.log). The closer to 0 this is, the better the match.

The input text is normalized using the same tokenizer as the trainer: stop words are removed and stems are used. Only words that are available in at least one category are used for the classification.

Tokenization

Whenever a string is passed to any train or classify functions, it's passed through the default tokenizer to turn the string into an array of words.

  • For classification, these words are used to check the database for existing probabilities
  • For training, the words are inserted directly into the database

Tokenization is more complicated than just splitting the string by spaces, text can be normalized and extraneous data can be stripped.

Sometimes, you may want to explicitly provide the words for insertion and classification. You can bypass tokenization by passing an array of words in place of the string when calling any classify or train function.

You can customize the tokenizer by providing a tokenize_text option. This should be a function that takes a single arugment, the string of text, and the return value is the tokens. For example:

local bayes = require("lapis.bayes")
bayes.train_text("spam", "Cheap Prom Dresses 2014 - Buy discount Prom Dress", {
  tokenize_text = function(text)
    -- your custom tokenizer goes here
    return {tok1, tok2, ...}
  end
})

Built-in tokenizers

Postgres Text is the default tokenizer used when no tokenizer is provided.

Postgres Text

Uses Postgres tsvector objects to normalize text. This will remove stop words, normalize capitalization and symbols, and convert words to lexemes. Duplicates are removed.

Note: The characteristics of this tokenizer may not be appropriate for your goals with spam detector: if you have very specific training data then preserving symbols, capitalization, and duplication would actually be useful. This tokenizer tries to make spam text more general purpose to match wider range of text that might not have specific training.

This tokenizer requires an active connection to a Postgres database (provided in the Lapis config). It will issue queries when tokenizing. The tokenizer is uses a query that is specific to English:

select unnest(tsvector_to_array(to_tsvector('english', 'my text here'))) as word

Example:

local Tokenizer = require "lapis.bayes.tokenizers.postgres_text"

local t = Tokenizer(opts)

local tokens = t:tokenize_text("Hello world This Is my tests example") --> {"exampl", "hello", "test", "world"}

local tokens2 = t:tokenize_text([[
  <div class='what is going on'>hello world<a href="http://leafo.net/hi.png">my image</a></div>
]]) --> {"hello", "imag", "world"}

Tokenizer options:

  • min_len: minimum token length (default 2)
  • max_len: maximum token length (default 12), tokens that don't fulfill length requirements will be excluded, not truncated
  • strip_numbers: remove tokens that are numbers (default true)
  • symbols_split_tokens: split apart tokens that contain a symbol before tokenization, eg. hello:world goes to hello world (default false)
  • filter_text: custom pre-filter function to process incoming text, takes text as first argument, should return text (optional, default nil)
  • filter_tokens: custom post-filter function to process output tokens, takes token array, should return a token array (optional, default nil)

URL Domains

Extracts mentions of domains from the text text, all other text is ignored.

local Tokenizer = require "lapis.bayes.tokenizers.url_domains"

local t = Tokenizer(opts)
local tokens = t:tokenize_text([[
  Please go to my https://leafo.net website <a href='itch.io'>hmm</a>
]]) --> {"leafo.net", "itch.io"}

Schema

lapis-bayes creates two tables:

  • lapis_bayes_categories
  • lapis_bayes_word_classifications

Running outside of Lapis

Creating a configuration

If you're not running lapis-bayes directly inside of Lapis you'll need to create a configuration file that instructs your script on how to connect to a database.

In your project root, create config.lua:

-- config.lua
local config = require("lapis.config")

config("development", {
  postgres = {
    database = "lapis_bayes"
  }
})

The example above provides the minimum required for lapis-bayes to connect to a PostgreSQL database. You're responsible for creating the actual database if it doesn't already exist.

For PostgreSQL you might run the command:

$ createdb -U postgres lapis_bayes

We're using the standard Lapis configuration format, you can read more about it here: http://leafo.net/lapis/reference/configuration.html

Creating the schema

After database connection has been established the schema (database tables) need to be created. This is done using Lapis' migrations.

Create a file, migrations.lua, and make it look like the following:

-- migrations.lua

return {
  require("lapis.bayes.schema").run_migrations
}

You can test your configuration now by running the migrations with the following command from your shell: (Note you must be in the same directory as your code, migrations.lua and config.lua)

$ lapis migrate

You're now ready to start training and classifying text! (Go back to the top of this document for the tutorial)

Contact

Author: Leaf Corcoran (leafo) (@moonscript)
Email: [email protected]
Homepage: http://leafo.net
License: MIT

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].