All Projects → nyanp → Nyaggle

nyanp / Nyaggle

Licence: mit
Code for Kaggle and Offline Competitions

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Nyaggle

Machinejs
[UNMAINTAINED] Automated machine learning- just give it a data file! Check out the production-ready version of this project at ClimbsRocks/auto_ml
Stars: ✭ 412 (+97.13%)
Mutual labels:  kaggle, ml
Machinelearningcourse
A collection of notebooks of my Machine Learning class written in python 3
Stars: ✭ 35 (-83.25%)
Mutual labels:  kaggle, ml
Kaggler
Code for Kaggle Data Science Competitions
Stars: ✭ 614 (+193.78%)
Mutual labels:  kaggle, feature-engineering
Lightautoml
LAMA - automatic model creation framework
Stars: ✭ 196 (-6.22%)
Mutual labels:  kaggle, feature-engineering
Feast
Feature Store for Machine Learning
Stars: ✭ 2,576 (+1132.54%)
Mutual labels:  ml, feature-engineering
Rgf
Home repository for the Regularized Greedy Forest (RGF) library. It includes original implementation from the paper and multithreaded one written in C++, along with various language-specific wrappers.
Stars: ✭ 341 (+63.16%)
Mutual labels:  kaggle, ml
Kaggle Quora Question Pairs
Kaggle:Quora Question Pairs, 4th/3396 (https://www.kaggle.com/c/quora-question-pairs)
Stars: ✭ 705 (+237.32%)
Mutual labels:  kaggle, feature-engineering
Bike-Sharing-Demand-Kaggle
Top 5th percentile solution to the Kaggle knowledge problem - Bike Sharing Demand
Stars: ✭ 33 (-84.21%)
Mutual labels:  kaggle, feature-engineering
Ml Dl Scripts
The repository provides usefull python scripts for ML and data analysis
Stars: ✭ 119 (-43.06%)
Mutual labels:  kaggle, ml
Kaggle Competitions
There are plenty of courses and tutorials that can help you learn machine learning from scratch but here in GitHub, I want to solve some Kaggle competitions as a comprehensive workflow with python packages. After reading, you can use this workflow to solve other real problems and use it as a template.
Stars: ✭ 86 (-58.85%)
Mutual labels:  kaggle, feature-engineering
sklearn-feature-engineering
使用sklearn做特征工程
Stars: ✭ 114 (-45.45%)
Mutual labels:  kaggle, feature-engineering
Transmogrifai
TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
Stars: ✭ 2,084 (+897.13%)
Mutual labels:  ml, feature-engineering
fastknn
Fast k-Nearest Neighbors Classifier for Large Datasets
Stars: ✭ 64 (-69.38%)
Mutual labels:  kaggle, feature-engineering
Open Solution Home Credit
Open solution to the Home Credit Default Risk challenge 🏡
Stars: ✭ 397 (+89.95%)
Mutual labels:  kaggle, feature-engineering
kaggle
Kaggle solutions
Stars: ✭ 17 (-91.87%)
Mutual labels:  ml, kaggle
Hyperparameter hunter
Easy hyperparameter optimization and automatic result saving across machine learning algorithms and libraries
Stars: ✭ 648 (+210.05%)
Mutual labels:  ml, feature-engineering
kaggle-berlin
Material of the Kaggle Berlin meetup group!
Stars: ✭ 36 (-82.78%)
Mutual labels:  kaggle, feature-engineering
Quora-Paraphrase-Question-Identification
Paraphrase question identification using Feature Fusion Network (FFN).
Stars: ✭ 19 (-90.91%)
Mutual labels:  kaggle, feature-engineering
Home Credit Default Risk
Default risk prediction for Home Credit competition - Fast, scalable and maintainable SQL-based feature engineering pipeline
Stars: ✭ 68 (-67.46%)
Mutual labels:  kaggle, feature-engineering
Machine Learning Workflow With Python
This is a comprehensive ML techniques with python: Define the Problem- Specify Inputs & Outputs- Data Collection- Exploratory data analysis -Data Preprocessing- Model Design- Training- Evaluation
Stars: ✭ 157 (-24.88%)
Mutual labels:  kaggle, feature-engineering

nyaggle

GitHub Actions CI Status GitHub Actions CI Status Python Versions Documentation Status

Documentation | Slide (Japanese)

nyaggle is a utility library for Kaggle and offline competitions, particularly focused on experiment tracking, feature engineering and validation.

  • nyaggle.ensemble - Averaging & stacking
  • nyaggle.experiment - Experiment tracking
  • nyaggle.feature_store - Lightweight feature storage using feather-format
  • nyaggle.features - sklearn-compatible features
  • nyaggle.hyper_parameters - Collection of GBDT hyper-parameters used in past Kaggle competitions
  • nyaggle.validation - Adversarial validation & sklearn-compatible CV splitters

Installation

You can install nyaggle via pip:

$pip install nyaggle

Examples

Experiment Tracking

run_experiment() is an high-level API for experiment with cross validation. It outputs parameters, metrics, out of fold predictions, test predictions, feature importance and submission.csv under the specified directory.

It can be combined with mlflow tracking.

from sklearn.model_selection import train_test_split

from nyaggle.experiment import run_experiment
from nyaggle.testing import make_classification_df

X, y = make_classification_df()
X_train, X_test, y_train, y_test = train_test_split(X, y)

params = {
    'n_estimators': 1000,
    'max_depth': 8
}

result = run_experiment(params,
                        X_train,
                        y_train,
                        X_test)

# You can get outputs that needed in data science competitions with 1 API

print(result.test_prediction)  # Test prediction in numpy array
print(result.oof_prediction)   # Out-of-fold prediction in numpy array
print(result.models)           # Trained models for each fold
print(result.importance)       # Feature importance for each fold
print(result.metrics)          # Evalulation metrics for each fold
print(result.time)             # Elapsed time
print(result.submission_df)    # The output dataframe saved as submission.csv

# ...and all outputs have been saved under the logging directory (default: output/yyyymmdd_HHMMSS).


# You can use it with mlflow and track your experiments through mlflow-ui
result = run_experiment(params,
                        X_train,
                        y_train,
                        X_test,
                        with_mlflow=True)

nyaggle also has a low-level API which has similar interface to mlflow tracking and wandb.

from nyaggle.experiment import Experiment

with Experiment(logging_directory='./output/') as exp:
    # log key-value pair as a parameter
    exp.log_param('lr', 0.01)
    exp.log_param('optimizer', 'adam')

    # log text
    exp.log('blah blah blah')

    # log metric
    exp.log_metric('CV', 0.85)

    # log numpy ndarray, pandas dafaframe and any artifacts
    exp.log_numpy('predicted', predicted)
    exp.log_dataframe('submission', sub, file_format='csv')
    exp.log_artifact('path-to-your-file')

Feature Engineering

Target Encoding with K-Fold

import pandas as pd
import numpy as np

from sklearn.model_selection import KFold
from nyaggle.feature.category_encoder import TargetEncoder


train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
all = pd.concat([train, test]).copy()

cat_cols = [c for c in train.columns if train[c].dtype == np.object]
target_col = 'y'

kf = KFold(5)

# Target encoding with K-fold
te = TargetEncoder(kf.split(train))

# use fit/fit_transform to train data, then apply transform to test data
train.loc[:, cat_cols] = te.fit_transform(train[cat_cols], train[target_col])
test.loc[:, cat_cols] = te.transform(test[cat_cols])

# ... or just call fit_transform to concatenated data
all.loc[:, cat_cols] = te.fit_transform(all[cat_cols], all[cat_cols])

Text Vectorization using BERT

You need to install pytorch to your virtual environment to use BertSentenceVectorizer. MaCab and mecab-python3 are also required if you use Japanese BERT model.

import pandas as pd
from nyaggle.feature.nlp import BertSentenceVectorizer


train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
all = pd.concat([train, test]).copy()

text_cols = ['body']
target_col = 'y'
group_col = 'user_id'


# extract BERT-based sentence vector
bv = BertSentenceVectorizer(text_columns=text_cols)

text_vector = bv.fit_transform(train)


# BERT + SVD, with cuda
bv = BertSentenceVectorizer(text_columns=text_cols, use_cuda=True, n_components=40)

text_vector_svd = bv.fit_transform(train)

# Japanese BERT
bv = BertSentenceVectorizer(text_columns=text_cols, lang='jp')

japanese_text_vector = bv.fit_transform(train)

Adversarial Validation

import pandas as pd
from nyaggle.validation import adversarial_validate

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

auc, importance = adversarial_validate(train, test, importance_type='gain')

Validation Splitters

nyaggle provides a set of validation splitters that compatible with sklean interface.

import pandas as pd
from sklearn.model_selection import cross_validate, KFold
from nyaggle.validation import TimeSeriesSplit, Take, Skip, Nth

train = pd.read_csv('train.csv', parse_dates='dt')

# time-series split
ts = TimeSeriesSplit(train['dt'])
ts.add_fold(train_interval=('2019-01-01', '2019-01-10'), test_interval=('2019-01-10', '2019-01-20'))
ts.add_fold(train_interval=('2019-01-06', '2019-01-15'), test_interval=('2019-01-15', '2019-01-25'))

cross_validate(..., cv=ts)

# take the first 3 folds out of 10
cross_validate(..., cv=Take(3, KFold(10)))

# skip the first 3 folds, and evaluate the remaining 7 folds
cross_validate(..., cv=Skip(3, KFold(10)))

# evaluate 1st fold
cross_validate(..., cv=Nth(1, ts))

Other Awesome Repositories

Here is a list of awesome repositories that provide general utility functions for data science competitions. Please let me know if you have another one :)

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].