All Projects → inspirehep → Magpie

inspirehep / Magpie

Licence: mit
Deep neural network framework for multi-label text classification

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Magpie

text classifier
Tensorflow2.3的文本分类项目,支持各种分类模型,支持相关tricks。
Stars: ✭ 135 (-78.3%)
Mutual labels:  word2vec, classification
Servenet
Service Classification based on Service Description
Stars: ✭ 21 (-96.62%)
Mutual labels:  classification, word2vec
Alink
Alink is the Machine Learning algorithm platform based on Flink, developed by the PAI team of Alibaba computing platform.
Stars: ✭ 2,936 (+372.03%)
Mutual labels:  word2vec, classification
Ml
A high-level machine learning and deep learning library for the PHP language.
Stars: ✭ 1,270 (+104.18%)
Mutual labels:  classification, prediction
Nlp research
NLP research:基于tensorflow的nlp深度学习项目,支持文本分类/句子匹配/序列标注/文本生成 四大任务
Stars: ✭ 141 (-77.33%)
Mutual labels:  classification, word2vec
Nlp Journey
Documents, papers and codes related to Natural Language Processing, including Topic Model, Word Embedding, Named Entity Recognition, Text Classificatin, Text Generation, Text Similarity, Machine Translation),etc. All codes are implemented intensorflow 2.0.
Stars: ✭ 1,290 (+107.4%)
Mutual labels:  classification, word2vec
Mlbox
MLBox is a powerful Automated Machine Learning python library.
Stars: ✭ 1,199 (+92.77%)
Mutual labels:  classification, prediction
Wordembeddings Elmo Fasttext Word2vec
Using pre trained word embeddings (Fasttext, Word2Vec)
Stars: ✭ 146 (-76.53%)
Mutual labels:  classification, word2vec
Tensorflow Resources
Curated Tensorflow code resources to help you get started with Deep Learning.
Stars: ✭ 330 (-46.95%)
Mutual labels:  classification, prediction
Tensorflow Book
Accompanying source code for Machine Learning with TensorFlow. Refer to the book for step-by-step explanations.
Stars: ✭ 4,448 (+615.11%)
Mutual labels:  classification
Vehicle counting tensorflow
🚘 "MORE THAN VEHICLE COUNTING!" This project provides prediction for speed, color and size of the vehicles with TensorFlow Object Counting API.
Stars: ✭ 582 (-6.43%)
Mutual labels:  prediction
Structured Self Attention
A Structured Self-attentive Sentence Embedding
Stars: ✭ 459 (-26.21%)
Mutual labels:  classification
Introneuralnetworks
Introducing neural networks to predict stock prices
Stars: ✭ 486 (-21.86%)
Mutual labels:  prediction
Humpback Whale Identification 1st
https://www.kaggle.com/c/humpback-whale-identification
Stars: ✭ 591 (-4.98%)
Mutual labels:  classification
Mlr3
mlr3: Machine Learning in R - next generation
Stars: ✭ 463 (-25.56%)
Mutual labels:  classification
Android Yolo
Real-time object detection on Android using the YOLO network with TensorFlow
Stars: ✭ 604 (-2.89%)
Mutual labels:  prediction
Ttach
Image Test Time Augmentation with PyTorch!
Stars: ✭ 455 (-26.85%)
Mutual labels:  classification
Food Recipe Cnn
food image to recipe with deep convolutional neural networks.
Stars: ✭ 448 (-27.97%)
Mutual labels:  classification
Breast cancer classifier
Deep Neural Networks Improve Radiologists' Performance in Breast Cancer Screening
Stars: ✭ 614 (-1.29%)
Mutual labels:  classification
Simplecvreproduction
Reproduce simple cv project including attention module, classification, object detection, segmentation, keypoint detection, tracking 😄 etc.
Stars: ✭ 602 (-3.22%)
Mutual labels:  classification

image

Magpie is a deep learning tool for multi-label text classification. It learns on the training corpus to assign labels to arbitrary text and can be used to predict those labels on unknown data. It has been developed at CERN to assign subject categories to High Energy Physics abstracts and extract keywords from them.

Very short introduction

>>> magpie = Magpie()
>>> magpie.init_word_vectors('/path/to/corpus', vec_dim=100)
>>> magpie.train('/path/to/corpus', ['label1', 'label2', 'label3'], epochs=3)
Training...
>>> magpie.predict_from_text('Well, that was quick!')
[('label1', 0.96), ('label3', 0.65), ('label2', 0.21)]

Short introduction

To train the model you need to have a large corpus of labeled data in a text format encoded as UTF-8. An example corpus can be found under data/hep-categories directory. Magpie looks for .txt files containing the text to predict on and corresponding .lab files with assigned labels in separate lines. A pair of files containing the labels and the text should have the same name and differ only in extensions e.g.

$ ls data/hep-categories
1000222.lab 1000222.txt 1000362.lab 1000362.txt 1001810.lab 1001810.txt ...

Before you train the model, you need to build appropriate word vector representations for your corpus. In theory, you can train them on a different corpus or reuse already trained ones (tutorial), however Magpie enables you to do that as well.

from magpie import Magpie

magpie = Magpie()
magpie.train_word2vec('data/hep-categories', vec_dim=100)

Then you need to fit a scaling matrix to normalize input data, it is specific to the trained word2vec representation. Here's the one liner:

magpie.fit_scaler('data/hep-categories')

You would usually want to combine those two steps, by simply running:

magpie.init_word_vectors('data/hep-categories', vec_dim=100)

If you plan to reuse the trained word representations, you might want to save them and pass in the constructor to Magpie next time. For the training, just type:

labels = ['Gravitation and Cosmology', 'Experiment-HEP', 'Theory-HEP']
magpie.train('data/hep-categories', labels, test_ratio=0.2, epochs=30)

By providing the test_ratio argument, the model splits data into train & test datasets (in this example into 80/20 ratio) and evaluates itself after every epoch displaying it's current loss and accuracy. The default value of test_ratio is 0 meaning that all the data will be used for training.

If your data doesn't fit into memory, you can also run magpie.batch_train() which has a similar API, but is more memory efficient.

Trained models can be used for prediction with methods:

>>> magpie.predict_from_file('data/hep-categories/1002413.txt')
[('Experiment-HEP', 0.47593361),
 ('Gravitation and Cosmology', 0.055745006),
 ('Theory-HEP', 0.02692855)]

>>> magpie.predict_from_text('Stephen Hawking studies black holes')
[('Gravitation and Cosmology', 0.96627593),
 ('Experiment-HEP', 0.64958507),
 ('Theory-HEP', 0.20917746)]

Saving & loading the model

A Magpie object consists of three components - the word2vec mappings, a scaler and a keras model. In order to train Magpie you can either provide the word2vec mappings and a scaler in advance or let the program compute them for you on the training data. Usually you would want to train them yourself on a full dataset and reuse them afterwards. You can use the provided functions for that purpose:

magpie.save_word2vec_model('/save/my/embeddings/here')
magpie.save_scaler('/save/my/scaler/here', overwrite=True)
magpie.save_model('/save/my/model/here.h5')

When you want to reinitialize your trained model, you can run:

magpie = Magpie(
    keras_model='/save/my/model/here.h5',
    word2vec_model='/save/my/embeddings/here',
    scaler='/save/my/scaler/here',
    labels=['cat', 'dog', 'cow']
)

or just pass the objects directly!

Installation

The package is not on PyPi, but you can get it directly from GitHub:

$ pip install git+https://github.com/inspirehep/[email protected]

If you encounter any problems with the installation, make sure to install the correct versions of dependencies listed in setup.py file.

Disclaimer & citation

The neural network models used within Magpie are based on work done by Yoon Kim and subsequently Mark Berger.

Contact

If you have any problems, feel free to open an issue. We'll do our best to help 👍

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].