All Projects → giacbrd → Shallowlearn

giacbrd / Shallowlearn

Licence: lgpl-3.0
An experiment about re-implementing supervised learning models based on shallow neural network approaches (e.g. fastText) with some additional exclusive features and nice API. Written in Python and fully compatible with Scikit-learn.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Shallowlearn

Gensim
Topic Modelling for Humans
Stars: ✭ 12,763 (+6411.73%)
Mutual labels:  word2vec, word-embeddings, fasttext, gensim
Nlp In Practice
Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.
Stars: ✭ 790 (+303.06%)
Mutual labels:  text-classification, word2vec, text-mining, gensim
Magnitude
A fast, efficient universal vector embedding utility package.
Stars: ✭ 1,394 (+611.22%)
Mutual labels:  word2vec, word-embeddings, fasttext, gensim
Fasttext.js
FastText for Node.js
Stars: ✭ 127 (-35.2%)
Mutual labels:  text-classification, word2vec, word-embeddings, fasttext
word embedding
Sample code for training Word2Vec and FastText using wiki corpus and their pretrained word embedding..
Stars: ✭ 21 (-89.29%)
Mutual labels:  word2vec, word-embeddings, fasttext
Lmdb Embeddings
Fast word vectors with little memory usage in Python
Stars: ✭ 404 (+106.12%)
Mutual labels:  word2vec, fasttext, gensim
Text2vec
Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
Stars: ✭ 715 (+264.8%)
Mutual labels:  word2vec, text-mining, word-embeddings
Nlp Journey
Documents, papers and codes related to Natural Language Processing, including Topic Model, Word Embedding, Named Entity Recognition, Text Classificatin, Text Generation, Text Similarity, Machine Translation),etc. All codes are implemented intensorflow 2.0.
Stars: ✭ 1,290 (+558.16%)
Mutual labels:  word2vec, fasttext, gensim
Product-Categorization-NLP
Multi-Class Text Classification for products based on their description with Machine Learning algorithms and Neural Networks (MLP, CNN, Distilbert).
Stars: ✭ 30 (-84.69%)
Mutual labels:  text-classification, word2vec, gensim
Tadw
An implementation of "Network Representation Learning with Rich Text Information" (IJCAI '15).
Stars: ✭ 43 (-78.06%)
Mutual labels:  word2vec, text-mining, gensim
Wordembeddings Elmo Fasttext Word2vec
Using pre trained word embeddings (Fasttext, Word2Vec)
Stars: ✭ 146 (-25.51%)
Mutual labels:  word2vec, fasttext, gensim
Text-Analysis
Explaining textual analysis tools in Python. Including Preprocessing, Skip Gram (word2vec), and Topic Modelling.
Stars: ✭ 48 (-75.51%)
Mutual labels:  text-mining, word2vec, word-embeddings
textlytics
Text processing library for sentiment analysis and related tasks
Stars: ✭ 25 (-87.24%)
Mutual labels:  scikit-learn, word-embeddings, supervised-learning
Fastrtext
R wrapper for fastText
Stars: ✭ 103 (-47.45%)
Mutual labels:  text-classification, word-embeddings, fasttext
Germanwordembeddings
Toolkit to obtain and preprocess german corpora, train models using word2vec (gensim) and evaluate them with generated testsets
Stars: ✭ 189 (-3.57%)
Mutual labels:  word2vec, word-embeddings, gensim
Text Analytics With Python
Learn how to process, classify, cluster, summarize, understand syntax, semantics and sentiment of text data with the power of Python! This repository contains code and datasets used in my book, "Text Analytics with Python" published by Apress/Springer.
Stars: ✭ 1,132 (+477.55%)
Mutual labels:  scikit-learn, text-classification, gensim
nlpbuddy
A text analysis application for performing common NLP tasks through a web dashboard interface and an API
Stars: ✭ 115 (-41.33%)
Mutual labels:  text-classification, gensim, fasttext
lda2vec
Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec from this paper https://arxiv.org/abs/1605.02019
Stars: ✭ 27 (-86.22%)
Mutual labels:  text-mining, word2vec, word-embeddings
Doc2vec
📓 Long(er) text representation and classification using Doc2Vec embeddings
Stars: ✭ 92 (-53.06%)
Mutual labels:  scikit-learn, text-classification, gensim
Scattertext
Beautiful visualizations of how language differs among document types.
Stars: ✭ 1,722 (+778.57%)
Mutual labels:  word2vec, text-mining, word-embeddings

ShallowLearn

A collection of supervised learning models based on shallow neural network approaches (e.g., word2vec and fastText) with some additional exclusive features. Written in Python and fully compatible with scikit-learn <http://scikit-learn.org>_.

Discussion group for users and developers: https://groups.google.com/d/forum/shallowlearn

.. image:: https://travis-ci.org/giacbrd/ShallowLearn.svg?branch=master :target: https://travis-ci.org/giacbrd/ShallowLearn .. image:: https://img.shields.io/pypi/v/shallowlearn.svg :target: https://pypi.python.org/pypi/ShallowLearn

Getting Started

Install the latest version:

.. code:: shell

pip install cython
pip install shallowlearn

Import models from shallowlearn.models, they implement the standard methods for supervised learning in scikit-learn, e.g., fit(X, y), predict(X), predict_proba(X), etc.

Data is raw text, each sample in the iterable X is a list of tokens (words of a document), while each element in the iterable y (corresponding to an element in X) can be a single label or a list in case of a multi-label training set. Obviously, y must be of the same size of X.

Models

GensimFastText

**Choose this model if your goal is classification with fastText!** (it is going to be the most stable and rich feature-wise)

A supervised learning model based on the fastText algorithm [1]_.
The code is mostly taken and rewritten from `Gensim <https://radimrehurek.com/gensim>`_,
it takes advantage of its optimizations (e.g. Cython) and support.

It is possible to choose the Softmax loss function (default) or one of its two "approximations":
Hierarchical Softmax and Negative Sampling. 

The parameter ``bucket`` configures the feature hashing space, i.e., the *hashing trick* described in [1]_.
Using the hashing trick together with ``partial_fit(X, y)`` yields a powerful *online* text classifier (see `Online learning`_).

It is possible to load pre-trained word vectors at initialization,
passing a Gensim ``Word2Vec`` or a ShallowLearn ``LabeledWord2Vec`` instance (the latter is retrievable from a
``GensimFastText`` model by the attribute ``classifier``).
With method ``fit_embeddings(X)`` it is possible to pre-train word vectors, using the current parameter values of the model.

Constructor argument names are a mix between the ones of Gensim and the ones of fastText (see this `class docstring <https://github.com/giacbrd/ShallowLearn/blob/master/shallowlearn/models.py#L74>`_).

.. code:: python

    >>> from shallowlearn.models import GensimFastText
    >>> clf = GensimFastText(size=100, min_count=0, loss='hs', iter=3, seed=66)
    >>> clf.fit([('i', 'am', 'tall'), ('you', 'are', 'fat')], ['yes', 'no'])
    >>> clf.predict([('tall', 'am', 'i')])
    ['yes']

FastText
~~~~~~~~
The supervised algorithm of fastText implemented in `fastText.py <https://github.com/salestock/fastText.py>`_ ,
which exposes an interface on the original C++ code.
The current advantages of this class over ``GensimFastText`` are the *subwords* and the *n-gram features* implemented
via the *hashing trick*.
The constructor arguments are equivalent to the original `supervised model
<https://github.com/salestock/fastText.py#supervised-model>`_, except for ``input_file``, ``output`` and
``label_prefix``.

**WARNING**: The only way of loading datasets in fastText.py is through the filesystem (as of version 0.8.2),
so data passed to ``fit(X, y)`` will be written in temporary files on disk.

.. code:: python

    >>> from shallowlearn.models import FastText
    >>> clf = FastText(dim=100, min_count=0, loss='hs', epoch=3, bucket=5, word_ngrams=2)
    >>> clf.fit([('i', 'am', 'tall'), ('you', 'are', 'fat')], ['yes', 'no'])
    >>> clf.predict([('tall', 'am', 'i')])
    ['yes']

DeepInverseRegression

TODO: Based on https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.score

DeepAveragingNetworks

*TODO*: Based on https://github.com/miyyer/dan

Exclusive Features
------------------
Next cool features will be listed as Issues in Github, for now:

Persistence
~~~~~~~~~~~
Any model can be serialized and de-serialized with the two methods ``save`` and ``load``.
They overload the `SaveLoad <https://radimrehurek.com/gensim/utils.html#gensim.utils.SaveLoad>`_ interface of Gensim,
so it is possible to control the cost on disk usage of the models, instead of simply *pickling* the objects.
The original interface also allows to use compression on the serialization outputs.

``save`` may create multiple files with names prefixed by the name given to the serialized model.

.. code:: python

    >>> from shallowlearn.models import GensimFastText
    >>> clf = GensimFastText(size=100, min_count=0, loss='hs', iter=3, seed=66)
    >>> clf.save('./model')
    >>> loaded = GensimFastText.load('./model') # it also creates ./model.CLF

Benchmarks
----------

Text classification
~~~~~~~~~~~~~~~~~~~

The script ``scripts/document_classification_20newsgroups.py`` refers to this
`scikit-learn example <http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html>`_
in which text classifiers are compared on a reference dataset;
we added our models to the comparison.
**The current results, even if still preliminary, are comparable with other
approaches, achieving the best performance in speed**.

Results as of release `0.0.5 <https://github.com/giacbrd/ShallowLearn/releases/tag/0.0.5>`_,
with *chi2_select* option set to 80%.
The times take into account of *tf-idf* vectorization in the “classic” classifiers, and the I/O operations for the
training of fastText.py.
The evaluation measure is *macro F1*.

.. image:: https://cdn.rawgit.com/giacbrd/ShallowLearn/master/images/benchmark.svg
    :alt: Text classifiers comparison
    :width: 888 px
    :align: center

Online learning
~~~~~~~~~~~~~~~

The script ``scripts/plot_out_of_core_classification.py`` computes a benchmark on some scikit-learn classifiers which are able to
learn incrementally,
a batch of examples at a time.
These classifiers can learn online by using the scikit-learn method ``partial_fit(X, y)``.
The `original example <http://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html>`_
describes the approach through feature hashing, which we set with parameter ``bucket``.

**The results are decent but there is room for improvement**.
We configure our classifier with ``iter=1, size=100, alpha=0.1, sample=0, min_count=0``, so to keep the model fast and
small, and to not cut off words from the few samples we have.

.. image:: https://cdn.rawgit.com/giacbrd/ShallowLearn/master/images/onlinelearning.svg
    :alt: Online learning
    :width: 700 px
    :align: center

References
----------
.. [1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].