chakki-works / chariot

Licence: Apache-2.0 license

Deliver the ready-to-train data to your NLP model.

Programming Languages

Jupyter Notebook

11667 projects

python

139335 projects - #7 most used programming language

Projects that are alternatives of or similar to chariot

BrainPrep

Preprocessing pipeline on Brain MR Images through FSL and ANTs, including registration, skull-stripping, bias field correction, enhancement and segmentation.

Stars: ✭ 107 (-13.01%)

Mutual labels: preprocessing

AutoTS

Automated Time Series Forecasting

Stars: ✭ 665 (+440.65%)

Mutual labels: preprocessing

multi-imbalance

Python package for tackling multi-class imbalance problems. http://www.cs.put.poznan.pl/mlango/publications/multiimbalance/

Stars: ✭ 66 (-46.34%)

Mutual labels: preprocessing

preprocess-conll05

Scripts for preprocessing the CoNLL-2005 SRL dataset.

Stars: ✭ 17 (-86.18%)

Mutual labels: preprocessing

Igel

a delightful machine learning tool that allows you to train, test, and use models without writing code

Stars: ✭ 2,956 (+2303.25%)

Mutual labels: preprocessing

remote-dataloader

PyTorch DataLoader processed in multiple remote computation machines for heavy data processings

Stars: ✭ 61 (-50.41%)

Mutual labels: preprocessing

pywedge

Makes Interactive Chart Widget, Cleans raw data, Runs baseline models, Interactive hyperparameter tuning & tracking

Stars: ✭ 49 (-60.16%)

Mutual labels: preprocessing

dmriprep

dMRIPrep is a robust and easy-to-use pipeline for preprocessing of diverse dMRI data. The transparent workflow dispenses of manual intervention, thereby ensuring the reproducibility of the results.

Stars: ✭ 55 (-55.28%)

Mutual labels: preprocessing

torcharrow

High performance model preprocessing library on PyTorch

Stars: ✭ 566 (+360.16%)

Mutual labels: preprocessing

NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.

Stars: ✭ 797 (+547.97%)

Mutual labels: preprocessing

podium

Podium: a framework agnostic Python NLP library for data loading and preprocessing

Stars: ✭ 55 (-55.28%)

Mutual labels: preprocessing

HyperGBM

A full pipeline AutoML tool for tabular data

Stars: ✭ 172 (+39.84%)

Mutual labels: preprocessing

Machine Learning

A repository of resources for understanding the concepts of machine learning/deep learning.

Stars: ✭ 29 (-76.42%)

Mutual labels: preprocessing

postcss-each

PostCSS plugin to iterate through values

Stars: ✭ 93 (-24.39%)

Mutual labels: preprocessing

arraymancer-vision

Simple library for image loading, preprocessing and visualization for working with arraymancer.

Stars: ✭ 28 (-77.24%)

Mutual labels: preprocessing

dropEst

Pipeline for initial analysis of droplet-based single-cell RNA-seq data

Stars: ✭ 71 (-42.28%)

Mutual labels: preprocessing

3D Ground Segmentation

A ground segmentation algorithm for 3D point clouds based on the work described in “Fast segmentation of 3D point clouds: a paradigm on LIDAR data for Autonomous Vehicle Applications”, D. Zermas, I. Izzat and N. Papanikolopoulos, 2017. Distinguish between road and non-road points. Road surface extraction. Plane fit ground filter

Stars: ✭ 55 (-55.28%)

Mutual labels: preprocessing

Start maja

To process a Sentinel-2 time series with MAJA cloud detection and atmospheric correction processor

Stars: ✭ 47 (-61.79%)

Mutual labels: preprocessing

prospectr

R package: Misc. Functions for Processing and Sample Selection of Spectroscopic Data

Stars: ✭ 26 (-78.86%)

Mutual labels: preprocessing

preprocessy

Python package for Customizable Data Preprocessing Pipelines

Stars: ✭ 34 (-72.36%)

Mutual labels: preprocessing

View All Similar Projects ➔

chariot

Deliver the ready-to-train data to your NLP model.

Prepare Dataset
- You can prepare typical NLP datasets through the chazutsu.
Build & Run Preprocess
- You can build the preprocess pipeline like scikit-learn Pipeline.
- Preprocesses for each dataset column are executed in parallel by Joblib.
- Multi-language text tokenization is supported by spaCy.
Format Batch
- Sampling a batch from preprocessed dataset and format it to train the model (padding etc).
- You can use pre-trained word vectors through the chakin.

chariot enables you to concentrate on training your model!

Install

pip install chariot

Prepare dataset

You can download various dataset by using chazutsu.

import chazutsu
from chariot.storage import Storage


storage = Storage("your/data/root")
r = chazutsu.datasets.MovieReview.polarity().download(storage.path("raw"))

df = storage.chazutsu(r.root).data()
df.head(5)

Then

	polarity	review
0	0	synopsis : an aging master art thief , his sup...
1	0	plot : a separated , glamorous , hollywood cou...
2	0	a friend invites you to a movie . this film wo...

Storage class manage the directory structure that follows cookie-cutter datascience.

Project root
  └── data
       ├── external     <- Data from third party sources (ex. word vectors).
       ├── interim      <- Intermediate data that has been transformed.
       ├── processed    <- The final, canonical datasets for modeling.
       └── raw          <- The original, immutable data dump.

Build & Run Preprocess

Build a preprocess pipeline

All preprocessors are defined at chariot.transformer.
Transformers are implemented by extending scikit-learn Transformer.
Because of this, the API of Transformer is familiar to you. And you can mix scikit-learn's preprocessors.

import chariot.transformer as ct
from chariot.preprocessor import Preprocessor


preprocessor = Preprocessor()
preprocessor\
    .stack(ct.text.UnicodeNormalizer())\
    .stack(ct.Tokenizer("en"))\
    .stack(ct.token.StopwordFilter("en"))\
    .stack(ct.Vocabulary(min_df=5, max_df=0.5))\
    .fit(train_data)

preprocessor.save("my_preprocessor.pkl")

loaded = Preprocessor.load("my_preprocessor.pkl")

There is 6 type of transformers are prepared in chariot.

TextPreprocessor
- Preprocess the text before tokenization.
- TextNormalizer: Normalize text (replace some character etc).
- TextFilter: Filter the text (delete some span in text stc).
Tokenizer
- Tokenize the texts.
- It powered by spaCy and you can choose MeCab or Janome for Japanese.
TokenPreprocessor
- Normalize/Filter the tokens after tokenization.
- TokenNormalizer: Normalize tokens (to lower, to original form etc).
- TokenFilter: Filter tokens (extract only noun etc).
Vocabulary
- Make vocabulary and convert tokens to indices.
Formatter
- Format (preprocessed) data for training your model.
Generator
- Genrate target data to train your (language) model.

Build a preprocess for dataset

When you want to make preprocess to each of your dataset column, you can use DatasetPreprocessor.

from chariot.dataset_preprocessor import DatasetPreprocessor
from chariot.transformer.formatter import Padding


dp = DatasetPreprocessor()
dp.process("review")\
    .by(ct.text.UnicodeNormalizer())\
    .by(ct.Tokenizer("en"))\
    .by(ct.token.StopwordFilter("en"))\
    .by(ct.Vocabulary(min_df=5, max_df=0.5))\
    .by(Padding(length=pad_length))\
    .fit(train_data["review"])
dp.process("polarity")\
    .by(ct.formatter.CategoricalLabel(num_class=3))


preprocessed = dp.preprocess(data)

# DatasetPreprocessor has multiple preprocessor.
# Because of this, save file format is `tar.gz`.
dp.save("my_dataset_preprocessor.tar.gz")

loaded = DatasetPreprocessor.load("my_dataset_preprocessor.tar.gz")

Train your model with chariot

chariot has feature to traing your model.

formatted = dp(train_data).preprocess().format().processed

model.fit(formatted["review"], formatted["polarity"], batch_size=32,
          validation_split=0.2, epochs=15, verbose=2)

for batch in dp(train_data.preprocess().iterate(batch_size=32, epoch=10):
    model.train_on_batch(batch["review"], batch["polarity"])

You can use pre-trained word vectors by chakin.

from chariot.storage import Storage
from chariot.transformer.vocabulary import Vocabulary

# Download word vector
storage = Storage("your/data/root")
storage.chakin(name="GloVe.6B.50d")

# Make embedding matrix
vocab = Vocabulary()
vocab.set(["you", "loaded", "word", "vector", "now"])
embed = vocab.make_embedding(storage.path("external/glove.6B.50d.txt"))
print(embed.shape)  # (len(vocab.count), 50)

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

chakki-works / chariot

Programming Languages

Labels

Projects that are alternatives of or similar to chariot

chariot

Install

Prepare dataset

Build & Run Preprocess

Build a preprocess pipeline

Build a preprocess for dataset

Train your model with chariot