All Projects → amaiya → Ktrain

amaiya / Ktrain

Licence: apache-2.0
ktrain is a Python library that makes deep learning and AI more accessible and easier to apply

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Ktrain

Aws Machine Learning University Accelerated Tab
Machine Learning University: Accelerated Tabular Data Class
Stars: ✭ 718 (-5.9%)
Mutual labels:  jupyter-notebook, tabular-data
Deltapy
DeltaPy - Tabular Data Augmentation (by @firmai)
Stars: ✭ 344 (-54.91%)
Mutual labels:  jupyter-notebook, tabular-data
Gans In Action
Companion repository to GANs in Action: Deep learning with Generative Adversarial Networks
Stars: ✭ 748 (-1.97%)
Mutual labels:  jupyter-notebook
Ml Course Msu
Lecture notes and code for Machine Learning practical course on CMC MSU
Stars: ✭ 759 (-0.52%)
Mutual labels:  jupyter-notebook
Graphneuralnetwork
《深入浅出图神经网络:GNN原理解析》配套代码
Stars: ✭ 754 (-1.18%)
Mutual labels:  jupyter-notebook
Causal inference python code
Python code for part 2 of the book Causal Inference: What If, by Miguel Hernán and James Robins
Stars: ✭ 748 (-1.97%)
Mutual labels:  jupyter-notebook
Jupyterhub
Multi-user server for Jupyter notebooks
Stars: ✭ 6,488 (+750.33%)
Mutual labels:  jupyter-notebook
Spark Movie Lens
An on-line movie recommender using Spark, Python Flask, and the MovieLens dataset
Stars: ✭ 745 (-2.36%)
Mutual labels:  jupyter-notebook
Jupyter2slides
Cloud Native Presentation Slides with Jupyter Notebook + Reveal.js
Stars: ✭ 762 (-0.13%)
Mutual labels:  jupyter-notebook
Machine learning refined
Notes, examples, and Python demos for the textbook "Machine Learning Refined" (published by Cambridge University Press).
Stars: ✭ 750 (-1.7%)
Mutual labels:  jupyter-notebook
Notedown
Markdown <=> IPython Notebook
Stars: ✭ 757 (-0.79%)
Mutual labels:  jupyter-notebook
Simclr
PyTorch implementation of SimCLR: A Simple Framework for Contrastive Learning of Visual Representations
Stars: ✭ 750 (-1.7%)
Mutual labels:  jupyter-notebook
Finetune alexnet with tensorflow
Code for finetuning AlexNet in TensorFlow >= 1.2rc0
Stars: ✭ 748 (-1.97%)
Mutual labels:  jupyter-notebook
Deep Learning Coursera
Deep Learning Specialization by Andrew Ng on Coursera.
Stars: ✭ 6,615 (+766.97%)
Mutual labels:  jupyter-notebook
Deeprl Tutorials
Contains high quality implementations of Deep Reinforcement Learning algorithms written in PyTorch
Stars: ✭ 748 (-1.97%)
Mutual labels:  jupyter-notebook
Ec2 Spot Labs
Collection of tools and code examples to demonstrate best practices in using Amazon EC2 Spot Instances.
Stars: ✭ 758 (-0.66%)
Mutual labels:  jupyter-notebook
Dat4
General Assembly's Data Science course in Washington, DC
Stars: ✭ 748 (-1.97%)
Mutual labels:  jupyter-notebook
Learning From Data
记录Learning from data一书中的习题解答
Stars: ✭ 751 (-1.57%)
Mutual labels:  jupyter-notebook
Automatic Watermark Detection
Project for Digital Image Processing
Stars: ✭ 754 (-1.18%)
Mutual labels:  jupyter-notebook
Superpoint
Efficient neural feature detector and descriptor
Stars: ✭ 761 (-0.26%)
Mutual labels:  jupyter-notebook

Overview | Tutorials | Examples | Installation | FAQ | How to Cite

PyPI Status ktrain python compatibility license Downloads

Welcome to ktrain

News and Announcements

  • 2021-03-10:
    • ktrain v0.26.x is released and now supports transformers>=4.0.0.
      Note that, transformers>=4.0.0 included a complete reogranization of the module's structure. This means that, if you saved a transformers-based Predictor (e.g., DistilBERT) in an older version of ktrain and transformers, you will need to either generate a new tf_model.preproc file or manually edit the existing tf_model.preproc file before loading the predictor in the latest versions of ktrain and transformers.
      For instance, suppose you trained a DistilBERT model and saved the resultant predictor using an older version of ktrain with: predictor.save('/tmp/my_predictor/'). After upgrading to the newest version of ktrain, you will find that ktrain.load_predictor('/tmp/my_predictor) will throw an error unless you follow one of the two approaches below:

      Approach 1: Manually edit tf_model.preproc file:
      Open tf_model.preproc with an editor like vim and edit it to replace old module locations with new module locations (example changes for a DistilBERT model shown below):

      # change transformers.configuration_distilbert to transformers.models.distilbert.configuration_distilbert
      # change transformers.modeling_tf_auto to transformers.models.auto.modeling_tf_auto
      # change transformers.tokenization_auto to transformers.models.auto.tokenization_auto  
      

      The above was confirmed to work using the vim editor on Linux.

      Approach 2: Re-generate tf_model.preproc file:

      # Step 1: Re-create a Preprocessor instance
      # NOTES:
      # 1. If training set is large, you can use a sample containing at least one example for each class
      # 2. Labels must be in same format as you originally used
      # 3. If original training set is not easily accessible, set preproc.preprocess_train_called=True 
      #    below instead of invoking preproc.preprocess_train(x_train, y_train)
      
      preproc = text.Transformer(MODEL_NAME, maxlen=500, class_names=class_names)
      trn = preproc.preprocess_train(x_train, y_train)
      
      # Step 2: load the transformers model from predictor folder
      from transformers import *
      model = TFAutoModelForSequenceClassification.from_pretrained('/tmp/my_predictor/')
      
      # Step 3: re-create/re-save Predictor
      predictor = ktrain.get_predictor(model, preproc)
      predictor.save('/tmp/my_new_predictor')
      
    • If you're using PyTorch 1.8 or above with ktrain, you will need to upgrade to ktrain>=0.26.0. If you're using ktrain<0.26.0, then you will have to downgrade PyTorch with: pip install torch==1.7.1.

  • 2020-11-08:
    • ktrain v0.25.x is released and includes out-of-the-box support for text extraction via the textract package . This, for example, can be used in the SimpleQA.index_from_folder method to perform Question-Answering on large collections of PDFs, MS Word documents, or PowerPoint files. See the Question-Answering example notebook for more information.
# End-to-End Question-Answering in ktrain

# index documents of different types into a built-in search engine
from ktrain import text
INDEXDIR = '/tmp/myindex'
text.SimpleQA.initialize_index(INDEXDIR)
corpus_path = '/my/folder/of/documents' # contains .pdf, .docx, .pptx files in addition to .txt files
text.SimpleQA.index_from_folder(corpus_path, INDEXDIR, use_text_extraction=True, # enable text extraction
                              multisegment=True, procs=4, # these args speed up indexing
                              breakup_docs=True)          # this slows indexing but speeds up answer retrieval

# ask questions (setting higher batch size can further speed up answer retrieval)
qa = text.SimpleQA(INDEXDIR)
answers = qa.ask('What is ktrain?', batch_size=8)

# top answer snippet extracted from https://arxiv.org/abs/2004.10703:
#   "ktrain is a low-code platform for machine learning"

Overview

ktrain is a lightweight wrapper for the deep learning library TensorFlow Keras (and other libraries) to help build, train, and deploy neural networks and other machine learning models. Inspired by ML framework extensions like fastai and ludwig, ktrain is designed to make deep learning and AI more accessible and easier to apply for both newcomers and experienced practitioners. With only a few lines of code, ktrain allows you to easily and quickly:

  • employ fast, accurate, and easy-to-use pre-canned models for text, vision, graph, and tabular data:

  • estimate an optimal learning rate for your model given your data using a Learning Rate Finder

  • utilize learning rate schedules such as the triangular policy, the 1cycle policy, and SGDR to effectively minimize loss and improve generalization

  • build text classifiers for any language (e.g., Arabic Sentiment Analysis with BERT, Chinese Sentiment Analysis with NBSVM)

  • easily train NER models for any language (e.g., Dutch NER )

  • load and preprocess text and image data from a variety of formats

  • inspect data points that were misclassified and provide explanations to help improve your model

  • leverage a simple prediction API for saving and deploying both models and data-preprocessing steps to make predictions on new raw data

Tutorials

Please see the following tutorial notebooks for a guide on how to use ktrain on your projects:

Some blog tutorials about ktrain are shown below:

ktrain: A Lightweight Wrapper for Keras to Help Train Neural Networks

BERT Text Classification in 3 Lines of Code

Text Classification with Hugging Face Transformers in TensorFlow 2 (Without Tears)

Build an Open-Domain Question-Answering System With BERT in 3 Lines of Code

Finetuning BERT using ktrain for Disaster Tweets Classification by Hamiz Ahmed

Examples

Tasks such as text classification and image classification can be accomplished easily with only a few lines of code.

Example: Text Classification of IMDb Movie Reviews Using BERT [see notebook]

import ktrain
from ktrain import text as txt

# load data
(x_train, y_train), (x_test, y_test), preproc = txt.texts_from_folder('data/aclImdb', maxlen=500, 
                                                                     preprocess_mode='bert',
                                                                     train_test_names=['train', 'test'],
                                                                     classes=['pos', 'neg'])

# load model
model = txt.text_classifier('bert', (x_train, y_train), preproc=preproc)

# wrap model and data in ktrain.Learner object
learner = ktrain.get_learner(model, 
                             train_data=(x_train, y_train), 
                             val_data=(x_test, y_test), 
                             batch_size=6)

# find good learning rate
learner.lr_find()             # briefly simulate training to find good learning rate
learner.lr_plot()             # visually identify best learning rate

# train using 1cycle learning rate schedule for 3 epochs
learner.fit_onecycle(2e-5, 3) 

Example: Classifying Images of Dogs and Cats Using a Pretrained ResNet50 model [see notebook]

import ktrain
from ktrain import vision as vis

# load data
(train_data, val_data, preproc) = vis.images_from_folder(
                                              datadir='data/dogscats',
                                              data_aug = vis.get_data_aug(horizontal_flip=True),
                                              train_test_names=['train', 'valid'], 
                                              target_size=(224,224), color_mode='rgb')

# load model
model = vis.image_classifier('pretrained_resnet50', train_data, val_data, freeze_layers=80)

# wrap model and data in ktrain.Learner object
learner = ktrain.get_learner(model=model, train_data=train_data, val_data=val_data, 
                             workers=8, use_multiprocessing=False, batch_size=64)

# find good learning rate
learner.lr_find()             # briefly simulate training to find good learning rate
learner.lr_plot()             # visually identify best learning rate

# train using triangular policy with ModelCheckpoint and implicit ReduceLROnPlateau and EarlyStopping
learner.autofit(1e-4, checkpoint_folder='/tmp/saved_weights') 

Example: Sequence Labeling for Named Entity Recognition using a randomly initialized Bidirectional LSTM CRF model [see notebook]

import ktrain
from ktrain import text as txt

# load data
(trn, val, preproc) = txt.entities_from_txt('data/ner_dataset.csv',
                                            sentence_column='Sentence #',
                                            word_column='Word',
                                            tag_column='Tag', 
                                            data_format='gmb',
                                            use_char=True) # enable character embeddings

# load model
model = txt.sequence_tagger('bilstm-crf', preproc)

# wrap model and data in ktrain.Learner object
learner = ktrain.get_learner(model, train_data=trn, val_data=val)


# conventional training for 1 epoch using a learning rate of 0.001 (Keras default for Adam optmizer)
learner.fit(1e-3, 1) 

Example: Node Classification on Cora Citation Graph using a GraphSAGE model [see notbook]

import ktrain
from ktrain import graph as gr

# load data with supervision ratio of 10%
(trn, val, preproc)  = gr.graph_nodes_from_csv(
                                               'cora.content', # node attributes/labels
                                               'cora.cites',   # edge list
                                               sample_size=20, 
                                               holdout_pct=None, 
                                               holdout_for_inductive=False,
                                              train_pct=0.1, sep='\t')

# load model
model=gr.graph_node_classifier('graphsage', trn)

# wrap model and data in ktrain.Learner object
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=64)


# find good learning rate
learner.lr_find(max_epochs=100) # briefly simulate training to find good learning rate
learner.lr_plot()               # visually identify best learning rate

# train using triangular policy with ModelCheckpoint and implicit ReduceLROnPlateau and EarlyStopping
learner.autofit(0.01, checkpoint_folder='/tmp/saved_weights')

Example: Text Classification with Hugging Face Transformers on 20 Newsgroups Dataset Using DistilBERT [see notebook]

# load text data
categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']
from sklearn.datasets import fetch_20newsgroups
train_b = fetch_20newsgroups(subset='train', categories=categories, shuffle=True)
test_b = fetch_20newsgroups(subset='test',categories=categories, shuffle=True)
(x_train, y_train) = (train_b.data, train_b.target)
(x_test, y_test) = (test_b.data, test_b.target)

# build, train, and validate model (Transformer is wrapper around transformers library)
import ktrain
from ktrain import text
MODEL_NAME = 'distilbert-base-uncased'
t = text.Transformer(MODEL_NAME, maxlen=500, class_names=train_b.target_names)
trn = t.preprocess_train(x_train, y_train)
val = t.preprocess_test(x_test, y_test)
model = t.get_classifier()
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=6)
learner.fit_onecycle(5e-5, 4)
learner.validate(class_names=t.get_classes()) # class_names must be string values

# Output from learner.validate()
#                        precision    recall  f1-score   support
#
#           alt.atheism       0.92      0.93      0.93       319
#         comp.graphics       0.97      0.97      0.97       389
#               sci.med       0.97      0.95      0.96       396
#soc.religion.christian       0.96      0.96      0.96       398
#
#              accuracy                           0.96      1502
#             macro avg       0.95      0.96      0.95      1502
#          weighted avg       0.96      0.96      0.96      1502

Example: Tabular Classification for Titanic Survival Prediction Using an MLP [see notebook]

import ktrain
from ktrain import tabular
import pandas as pd
train_df = pd.read_csv('train.csv', index_col=0)
train_df = train_df.drop(['Name', 'Ticket', 'Cabin'], 1)
trn, val, preproc = tabular.tabular_from_df(train_df, label_columns=['Survived'], random_state=42)
learner = ktrain.get_learner(tabular.tabular_classifier('mlp', trn), train_data=trn, val_data=val)
learner.lr_find(show_plot=True, max_epochs=5) # estimate learning rate
learner.fit_onecycle(5e-3, 10)

# evaluate held-out labeled test set
tst = preproc.preprocess_test(pd.read_csv('heldout.csv', index_col=0))
learner.evaluate(tst, class_names=preproc.get_classes())

Using ktrain on Google Colab? See these Colab examples:

Additional examples can be found here.

Installation

  1. Make sure pip is up-to-date with: pip install -U pip

  2. Install TensorFlow 2 if it is not already installed (e.g., pip install tensorflow)

  3. Install ktrain: pip install ktrain

The above should be all you need on Linux systems and cloud computing environments like Google Colab and AWS EC2. If you are using ktrain on a Windows computer, you can follow these more detailed instructions that include some extra steps.

Some important things to note about installation:

  • If using ktrain with tensorflow<=2.1, you must also downgrade the transformers library to transformers==3.1.
  • As of v0.21.x, ktrain no longer installs TensorFlow 2 automatically. As indicated above, you should install TensorFlow 2 yourself before installing and using ktrain. On Google Colab, TensorFlow 2 should be already installed. You should be able to use ktrain with any version of TensorFlow 2. Note, however, that there is a bug in TensorFlow 2.2 and 2.3 that affects the Learning-Rate-Finder that will not be fixed until TensorFlow 2.4. The bug causes the learning-rate-finder to complete all epochs even after loss has diverged (i.e., no automatic-stopping).
  • If using ktrain on a local machine with a GPU (versus Google Colab, for example), you'll need to install GPU support for TensorFlow 2.
  • Since some ktrain dependencies have not yet been migrated to tf.keras in TensorFlow 2 (or may have other issues), ktrain is temporarily using forked versions of some libraries. Specifically, ktrain uses forked versions of the eli5 and stellargraph libraries. If not installed, ktrain will complain when a method or function needing either of these libraries is invoked. To install these forked versions, you can do the following:
pip install git+https://github.com/amaiya/[email protected]_0_10_1
pip install git+https://github.com/amaiya/[email protected]_tf_dep_082

This code was tested on Ubuntu 18.04 LTS using TensorFlow 2.3.1 and Python 3.6.9.

How to Cite

Please cite the following paper when using ktrain:

@article{maiya2020ktrain,
    title={ktrain: A Low-Code Library for Augmented Machine Learning},
    author={Arun S. Maiya},
    year={2020},
    eprint={2004.10703},
    archivePrefix={arXiv},
    primaryClass={cs.LG},
    journal={arXiv preprint arXiv:2004.10703},
}


Creator: Arun S. Maiya

Email: arun [at] maiya [dot] net

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].