All Projects → raghakot → Keras Text

raghakot / Keras Text

Licence: mit
Text Classification Library in Keras

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Keras Text

Reuters-21578-Classification
Text classification with Reuters-21578 datasets using Gensim Word2Vec and Keras LSTM
Stars: ✭ 44 (-89.55%)
Mutual labels:  theano, text-classification
Cnn Text Classification Keras
Text Classification by Convolutional Neural Network in Keras
Stars: ✭ 213 (-49.41%)
Mutual labels:  text-classification, theano
Gather Deployment
Gathers scalable tensorflow and infrastructure deployment
Stars: ✭ 326 (-22.57%)
Mutual labels:  text-classification
Zhihu Text Classification
[2017知乎看山杯 多标签 文本分类] ye组(第六名) 解题方案
Stars: ✭ 392 (-6.89%)
Mutual labels:  text-classification
Text mining resources
Resources for learning about Text Mining and Natural Language Processing
Stars: ✭ 358 (-14.96%)
Mutual labels:  text-classification
Dynamic Memory Networks In Theano
Implementation of Dynamic memory networks by Kumar et al. http://arxiv.org/abs/1506.07285
Stars: ✭ 334 (-20.67%)
Mutual labels:  theano
Spacy Streamlit
👑 spaCy building blocks and visualizers for Streamlit apps
Stars: ✭ 360 (-14.49%)
Mutual labels:  text-classification
Deep srl
Code and pre-trained model for: Deep Semantic Role Labeling: What Works and What's Next
Stars: ✭ 309 (-26.6%)
Mutual labels:  theano
Whatlang Rs
Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/
Stars: ✭ 400 (-4.99%)
Mutual labels:  text-classification
Artificial Adversary
🗣️ Tool to generate adversarial text examples and test machine learning models against them
Stars: ✭ 348 (-17.34%)
Mutual labels:  text-classification
Lasagne
Lightweight library to build and train neural networks in Theano
Stars: ✭ 3,800 (+802.61%)
Mutual labels:  theano
Snips Nlu
Snips Python library to extract meaning from text
Stars: ✭ 3,583 (+751.07%)
Mutual labels:  text-classification
Text Classification Cnn Rnn
CNN-RNN中文文本分类,基于TensorFlow
Stars: ✭ 3,613 (+758.19%)
Mutual labels:  text-classification
Rmdl
RMDL: Random Multimodel Deep Learning for Classification
Stars: ✭ 375 (-10.93%)
Mutual labels:  text-classification
Learning
The data is the future of oil, digging the potential value of the data is very meaningful. This library records my road of machine learning study.
Stars: ✭ 330 (-21.62%)
Mutual labels:  theano
Gempy
GemPy is an open-source, Python-based 3-D structural geological modeling software, which allows the implicit (i.e. automatic) creation of complex geological models from interface and orientation data. It also offers support for stochastic modeling to adress parameter and model uncertainties.
Stars: ✭ 396 (-5.94%)
Mutual labels:  theano
Theano lstm
🔬 Nano size Theano LSTM module
Stars: ✭ 310 (-26.37%)
Mutual labels:  theano
Draw
Reimplementation of DRAW
Stars: ✭ 346 (-17.81%)
Mutual labels:  theano
Nlp Projects
word2vec, sentence2vec, machine reading comprehension, dialog system, text classification, pretrained language model (i.e., XLNet, BERT, ELMo, GPT), sequence labeling, information retrieval, information extraction (i.e., entity, relation and event extraction), knowledge graph, text generation, network embedding
Stars: ✭ 360 (-14.49%)
Mutual labels:  text-classification
Multi Class Text Classification Cnn
Classify Kaggle Consumer Finance Complaints into 11 classes. Build the model with CNN (Convolutional Neural Network) and Word Embeddings on Tensorflow.
Stars: ✭ 410 (-2.61%)
Mutual labels:  text-classification

Keras Text Classification Library

Build Status license Slack

keras-text is a one-stop text classification library implementing various state of the art models with a clean and extendable interface to implement custom architectures.

Quick start

Create a tokenizer to build your vocabulary

  • To represent you dataset as (docs, words) use WordTokenizer
  • To represent you dataset as (docs, sentences, words) use SentenceWordTokenizer
  • To create arbitrary hierarchies, extend Tokenizer and implement the token_generator method.
from keras_text.processing import WordTokenizer


tokenizer = WordTokenizer()
tokenizer.build_vocab(texts)

Want to tokenize with character tokens to leverage character models? Use CharTokenizer.

Build a dataset

A dataset encapsulates tokenizer, X, y and the test set. This allows you to focus your efforts on trying various architectures/hyperparameters without having to worry about inconsistent evaluation. A dataset can be saved and loaded from the disk.

from keras_text.data import Dataset


ds = Dataset(X, y, tokenizer=tokenizer)
ds.update_test_indices(test_size=0.1)
ds.save('dataset')

The update_test_indices method automatically stratifies multi-class or multi-label data correctly.

Build text classification models

See tests/ folder for usage.

Word based models

When dataset represented as (docs, words) word based models can be created using TokenModelFactory.

from keras_text.models import TokenModelFactory
from keras_text.models import YoonKimCNN, AttentionRNN, StackedRNN


# RNN models can use `max_tokens=None` to indicate variable length words per mini-batch.
factory = TokenModelFactory(1, tokenizer.token_index, max_tokens=100, embedding_type='glove.6B.100d')
word_encoder_model = YoonKimCNN()
model = factory.build_model(token_encoder_model=word_encoder_model)
model.compile(optimizer='adam', loss='categorical_crossentropy')
model.summary()

Currently supported models include:

  • Yoon Kim CNN
  • Stacked RNNs
  • Attention (with/without context) based RNN encoders.

TokenModelFactory.build_model uses the provided word encoder which is then classified via Dense block.

Sentence based models

When dataset represented as (docs, sentences, words) sentence based models can be created using SentenceModelFactory.

from keras_text.models import SentenceModelFactory
from keras_text.models import YoonKimCNN, AttentionRNN, StackedRNN, AveragingEncoder


# Pad max sentences per doc to 500 and max words per sentence to 200.
# Can also use `max_sents=None` to allow variable sized max_sents per mini-batch.
factory = SentenceModelFactory(10, tokenizer.token_index, max_sents=500, max_tokens=200, embedding_type='glove.6B.100d')
word_encoder_model = AttentionRNN()
sentence_encoder_model = AttentionRNN()

# Allows you to compose arbitrary word encoders followed by sentence encoder.
model = factory.build_model(word_encoder_model, sentence_encoder_model)
model.compile(optimizer='adam', loss='categorical_crossentropy')
model.summary()

Currently supported models include:

  • Yoon Kim CNN
  • Stacked RNNs
  • Attention (with/without context) based RNN encoders.

SentenceModelFactory.build_model created a tiered model where words within a sentence is first encoded using
word_encoder_model. All such encodings per sentence is then encoded using sentence_encoder_model.

  • Hierarchical attention networks (HANs) can be build by composing two attention based RNN models. This is useful when a document is very large.
  • For smaller document a reasonable way to encode sentences is to average words within it. This can be done by using token_encoder_model=AveragingEncoder()
  • Mix and match encoders as you see fit for your problem.

Resources

TODO: Update documentation and add notebook examples.

Stay tuned for better documentation and examples. Until then, the best resource is to refer to the API docs

Installation

  1. Install keras with theano or tensorflow backend. Note that this library requires Keras > 2.0

  2. Install keras-text

From sources

sudo python setup.py install

PyPI package

sudo pip install keras-text
  1. Download target spacy model

keras-text uses the excellent spacy library for tokenization. See instructions on how to download model for target language.

Citation

Please cite keras-text in your publications if it helped your research. Here is an example BibTeX entry:

@misc{raghakotkerastext
  title={keras-text},
  author={Kotikalapudi, Raghavendra and contributors},
  year={2017},
  publisher={GitHub},
  howpublished={\url{https://github.com/raghakot/keras-text}},
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].