All Projects → salestock → Fasttext.py

salestock / Fasttext.py

Licence: bsd-3-clause
A Python interface for Facebook fastText

Projects that are alternatives of or similar to Fasttext.py

node-fasttext
Nodejs binding for fasttext representation and classification.
Stars: ✭ 39 (-96.43%)
Mutual labels:  classifier, text-classification, fasttext
nlpbuddy
A text analysis application for performing common NLP tasks through a web dashboard interface and an API
Stars: ✭ 115 (-89.46%)
Mutual labels:  text-classification, fasttext
ML4K-AI-Extension
Use machine learning in AppInventor, with easy training using text, images, or numbers through the Machine Learning for Kids website.
Stars: ✭ 18 (-98.35%)
Mutual labels:  classifier, text-classification
Ml Classify Text Js
Machine learning based text classification in JavaScript using n-grams and cosine similarity
Stars: ✭ 38 (-96.52%)
Mutual labels:  text-classification, classifier
Shallowlearn
An experiment about re-implementing supervised learning models based on shallow neural network approaches (e.g. fastText) with some additional exclusive features and nice API. Written in Python and fully compatible with Scikit-learn.
Stars: ✭ 196 (-82.03%)
Mutual labels:  text-classification, fasttext
Ai law
all kinds of baseline models for long text classificaiton( text categorization)
Stars: ✭ 243 (-77.73%)
Mutual labels:  text-classification, fasttext
text2class
Multi-class text categorization using state-of-the-art pre-trained contextualized language models, e.g. BERT
Stars: ✭ 15 (-98.63%)
Mutual labels:  classifier, text-classification
extremeText
Library for fast text representation and extreme classification.
Stars: ✭ 141 (-87.08%)
Mutual labels:  text-classification, fasttext
Whatlang Rs
Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/
Stars: ✭ 400 (-63.34%)
Mutual labels:  text-classification, classifier
Textclassification Keras
Text classification models implemented in Keras, including: FastText, TextCNN, TextRNN, TextBiRNN, TextAttBiRNN, HAN, RCNN, RCNNVariant, etc.
Stars: ✭ 621 (-43.08%)
Mutual labels:  text-classification, fasttext
Text Classification Demos
Neural models for Text Classification in Tensorflow, such as cnn, dpcnn, fasttext, bert ...
Stars: ✭ 144 (-86.8%)
Mutual labels:  text-classification, fasttext
Text classification
all kinds of text classification models and more with deep learning
Stars: ✭ 7,179 (+558.02%)
Mutual labels:  text-classification, fasttext
Fasttext.js
FastText for Node.js
Stars: ✭ 127 (-88.36%)
Mutual labels:  text-classification, fasttext
Nepali-News-Classifier
Text Classification of Nepali Language Document. This Mini Project was done for the partial fulfillment of NLP Course : COMP 473.
Stars: ✭ 13 (-98.81%)
Mutual labels:  classifier, text-classification
Fastrtext
R wrapper for fastText
Stars: ✭ 103 (-90.56%)
Mutual labels:  text-classification, fasttext
support-tickets-classification
This case study shows how to create a model for text analysis and classification and deploy it as a web service in Azure cloud in order to automatically classify support tickets. This project is a proof of concept made by Microsoft (Commercial Software Engineering team) in collaboration with Endava http://endava.com/en
Stars: ✭ 142 (-86.98%)
Mutual labels:  classifier, text-classification
Bert language understanding
Pre-training of Deep Bidirectional Transformers for Language Understanding: pre-train TextCNN
Stars: ✭ 933 (-14.48%)
Mutual labels:  text-classification, fasttext
Keras Textclassification
中文长文本分类、短句子分类、多标签分类、两句子相似度(Chinese Text Classification of Keras NLP, multi-label classify, or sentence classify, long or short),字词句向量嵌入层(embeddings)和网络层(graph)构建基类,FastText,TextCNN,CharCNN,TextRNN, RCNN, DCNN, DPCNN, VDCNN, CRNN, Bert, Xlnet, Albert, Attention, DeepMoji, HAN, 胶囊网络-CapsuleNet, Transformer-encode, Seq2seq, SWEM, LEAM, TextGCN
Stars: ✭ 914 (-16.22%)
Mutual labels:  text-classification, fasttext
Neural Networks
All about Neural Networks!
Stars: ✭ 34 (-96.88%)
Mutual labels:  fasttext
Pytorchtext
1st Place Solution for Zhihu Machine Learning Challenge . Implementation of various text-classification models.(知乎看山杯第一名解决方案)
Stars: ✭ 1,022 (-6.32%)
Mutual labels:  fasttext

fasttext Build Status PyPI version

fasttext is a Python interface for Facebook fastText.

Update

The fasttext pypi is now maintained by Facebook AI Research team. Read the documentation here: fastText python binding.

Requirements

fasttext support Python 2.6 or newer. It requires Cython in order to build the C++ extension.

Installation

pip install fasttext

Example usage

This package has two main use cases: word representation learning and text classification.

These were described in the two papers 1 and 2.

Word representation learning

In order to learn word vectors, as described in 1, we can use fasttext.skipgram and fasttext.cbow function like the following:

import fasttext

# Skipgram model
model = fasttext.skipgram('data.txt', 'model')
print model.words # list of words in dictionary

# CBOW model
model = fasttext.cbow('data.txt', 'model')
print model.words # list of words in dictionary

where data.txt is a training file containing utf-8 encoded text. By default the word vectors will take into account character n-grams from 3 to 6 characters.

At the end of optimization the program will save two files: model.bin and model.vec.

model.vec is a text file containing the word vectors, one per line. model.bin is a binary file containing the parameters of the model along with the dictionary and all hyper parameters.

The binary file can be used later to compute word vectors or to restart the optimization.

The following fasttext(1) command is equivalent

# Skipgram model
./fasttext skipgram -input data.txt -output model

# CBOW model
./fasttext cbow -input data.txt -output model

Obtaining word vectors for out-of-vocabulary words

The previously trained model can be used to compute word vectors for out-of-vocabulary words.

print model['king'] # get the vector of the word 'king'

the following fasttext(1) command is equivalent:

echo "king" | ./fasttext print-vectors model.bin

This will output the vector of word king to the standard output.

Load pre-trained model

We can use fasttext.load_model to load pre-trained model:

model = fasttext.load_model('model.bin')
print model.words # list of words in dictionary
print model['king'] # get the vector of the word 'king'

Text classification

This package can also be used to train supervised text classifiers and load pre-trained classifier from fastText.

In order to train a text classifier using the method described in 2, we can use the following function:

classifier = fasttext.supervised('data.train.txt', 'model')

equivalent as fasttext(1) command:

./fasttext supervised -input data.train.txt -output model

where data.train.txt is a text file containing a training sentence per line along with the labels. By default, we assume that labels are words that are prefixed by the string __label__.

We can specify the label prefix with the label_prefix param:

classifier = fasttext.supervised('data.train.txt', 'model', label_prefix='__label__')

equivalent as fasttext(1) command:

./fasttext supervised -input data.train.txt -output model -label '__label__'

This will output two files: model.bin and model.vec.

Once the model was trained, we can evaluate it by computing the precision at 1 ([email protected]) and the recall on a test set using classifier.test function:

result = classifier.test('test.txt')
print '[email protected]:', result.precision
print '[email protected]:', result.recall
print 'Number of examples:', result.nexamples

This will print the same output to stdout as:

./fasttext test model.bin test.txt

In order to obtain the most likely label for a list of text, we can use classifer.predict method:

texts = ['example very long text 1', 'example very longtext 2']
labels = classifier.predict(texts)
print labels

# Or with the probability
labels = classifier.predict_proba(texts)
print labels

We can specify k value to get the k-best labels from classifier:

labels = classifier.predict(texts, k=3)
print labels

# Or with the probability
labels = classifier.predict_proba(texts, k=3)
print labels

This interface is equivalent as fasttext(1) predict command. The same model with the same input set will have the same prediction.

API documentation

Skipgram model

Train & load skipgram model

model = fasttext.skipgram(params)

List of available params and their default value:

input_file     training file path (required)
output         output file path (required)
lr             learning rate [0.05]
lr_update_rate change the rate of updates for the learning rate [100]
dim            size of word vectors [100]
ws             size of the context window [5]
epoch          number of epochs [5]
min_count      minimal number of word occurences [5]
neg            number of negatives sampled [5]
word_ngrams    max length of word ngram [1]
loss           loss function {ns, hs, softmax} [ns]
bucket         number of buckets [2000000]
minn           min length of char ngram [3]
maxn           max length of char ngram [6]
thread         number of threads [12]
t              sampling threshold [0.0001]
silent         disable the log output from the C++ extension [1]
encoding       specify input_file encoding [utf-8]

Example usage:

model = fasttext.skipgram('train.txt', 'model', lr=0.1, dim=300)

CBOW model

Train & load CBOW model

model = fasttext.cbow(params)

List of available params and their default value:

input_file     training file path (required)
output         output file path (required)
lr             learning rate [0.05]
lr_update_rate change the rate of updates for the learning rate [100]
dim            size of word vectors [100]
ws             size of the context window [5]
epoch          number of epochs [5]
min_count      minimal number of word occurences [5]
neg            number of negatives sampled [5]
word_ngrams    max length of word ngram [1]
loss           loss function {ns, hs, softmax} [ns]
bucket         number of buckets [2000000]
minn           min length of char ngram [3]
maxn           max length of char ngram [6]
thread         number of threads [12]
t              sampling threshold [0.0001]
silent         disable the log output from the C++ extension [1]
encoding       specify input_file encoding [utf-8]

Example usage:

model = fasttext.cbow('train.txt', 'model', lr=0.1, dim=300)

Load pre-trained model

File .bin that previously trained or generated by fastText can be loaded using this function

model = fasttext.load_model('model.bin', encoding='utf-8')

Attributes and methods for the model

Skipgram and CBOW model have the following atributes & methods

model.model_name       # Model name
model.words            # List of words in the dictionary
model.dim              # Size of word vector
model.ws               # Size of context window
model.epoch            # Number of epochs
model.min_count        # Minimal number of word occurences
model.neg              # Number of negative sampled
model.word_ngrams      # Max length of word ngram
model.loss_name        # Loss function name
model.bucket           # Number of buckets
model.minn             # Min length of char ngram
model.maxn             # Max length of char ngram
model.lr_update_rate   # Rate of updates for the learning rate
model.t                # Value of sampling threshold
model.encoding         # Encoding of the model
model[word]            # Get the vector of specified word

Supervised model

Train & load the classifier

classifier = fasttext.supervised(params)

List of available params and their default value:

input_file     			training file path (required)
output         			output file path (required)
label_prefix   			label prefix ['__label__']
lr             			learning rate [0.1]
lr_update_rate 			change the rate of updates for the learning rate [100]
dim            			size of word vectors [100]
ws             			size of the context window [5]
epoch          			number of epochs [5]
min_count      			minimal number of word occurences [1]
neg            			number of negatives sampled [5]
word_ngrams    			max length of word ngram [1]
loss           			loss function {ns, hs, softmax} [softmax]
bucket         			number of buckets [0]
minn           			min length of char ngram [0]
maxn           			max length of char ngram [0]
thread         			number of threads [12]
t              			sampling threshold [0.0001]
silent         			disable the log output from the C++ extension [1]
encoding       			specify input_file encoding [utf-8]
pretrained_vectors		pretrained word vectors (.vec file) for supervised learning []

Example usage:

classifier = fasttext.supervised('train.txt', 'model', label_prefix='__myprefix__',
                                 thread=4)

Load pre-trained classifier

File .bin that previously trained or generated by fastText can be loaded using this function.

./fasttext supervised -input train.txt -output classifier -label 'some_prefix'
classifier = fasttext.load_model('classifier.bin', label_prefix='some_prefix')

Test classifier

This is equivalent as fasttext(1) test command. The test using the same model and test set will produce the same value for the precision at one and the number of examples.

result = classifier.test(params)

# Properties
result.precision # Precision at one
result.recall    # Recall at one
result.nexamples # Number of test examples

The param k is optional, and equal to 1 by default.

Predict the most-likely label of texts

This interface is equivalent as fasttext(1) predict command.

texts is an array of string

labels = classifier.predict(texts, k)

# Or with probability
labels = classifier.predict_proba(texts, k)

The param k is optional, and equal to 1 by default.

Attributes and methods for the classifier

Classifier have the following atributes & methods

classifier.labels                  # List of labels
classifier.label_prefix            # Prefix of the label
classifier.dim                     # Size of word vector
classifier.ws                      # Size of context window
classifier.epoch                   # Number of epochs
classifier.min_count               # Minimal number of word occurences
classifier.neg                     # Number of negative sampled
classifier.word_ngrams             # Max length of word ngram
classifier.loss_name               # Loss function name
classifier.bucket                  # Number of buckets
classifier.minn                    # Min length of char ngram
classifier.maxn                    # Max length of char ngram
classifier.lr_update_rate          # Rate of updates for the learning rate
classifier.t                       # Value of sampling threshold
classifier.encoding                # Encoding that used by classifier
classifier.test(filename, k)       # Test the classifier
classifier.predict(texts, k)       # Predict the most likely label
classifier.predict_proba(texts, k) # Predict the most likely label include their probability

The param k for classifier.test, classifier.predict and classifier.predict_proba is optional, and equal to 1 by default.

References

Enriching Word Vectors with Subword Information

[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information

@article{bojanowski2016enriching,
  title={Enriching Word Vectors with Subword Information},
  author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.04606},
  year={2016}
}

Bag of Tricks for Efficient Text Classification

[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

@article{joulin2016bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.01759},
  year={2016}
}

(* These authors contributed equally.)

Join the fastText community

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].