Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms. Feature engineering can be considered as applied machine learning itself.

Stars: ✭ 218 (-59.18%)

Mutual labels: data-science, feature-extraction, feature-engineering

Kaggle Competitions

There are plenty of courses and tutorials that can help you learn machine learning from scratch but here in GitHub, I want to solve some Kaggle competitions as a comprehensive workflow with python packages. After reading, you can use this workflow to solve other real problems and use it as a template.

Stars: ✭ 86 (-83.9%)

Mutual labels: data-science, feature-extraction, feature-engineering

Deltapy

DeltaPy - Tabular Data Augmentation (by @firmai)

Stars: ✭ 344 (-35.58%)

Mutual labels: data-science, feature-extraction, feature-engineering

Awesome Feature Engineering

A curated list of resources dedicated to Feature Engineering Techniques for Machine Learning

Stars: ✭ 433 (-18.91%)

Mutual labels: data-science, feature-extraction, feature-engineering

Blurr

Data transformations for the ML era

Stars: ✭ 96 (-82.02%)

Mutual labels: data-science, feature-extraction, feature-engineering

Nni

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.

Stars: ✭ 10,698 (+1903.37%)

Mutual labels: data-science, feature-extraction, feature-engineering

tsflex

Flexible time series feature extraction & processing

Stars: ✭ 252 (-52.81%)

Mutual labels: feature-extraction, feature-engineering

50-days-of-Statistics-for-Data-Science

This repository consist of a 50-day program. All the statistics required for the complete understanding of data science will be uploaded in this repository.

Stars: ✭ 19 (-96.44%)

Mutual labels: feature-extraction, feature-engineering

autoencoders tensorflow

Automatic feature engineering using deep learning and Bayesian inference using TensorFlow.

Stars: ✭ 66 (-87.64%)

Mutual labels: feature-extraction, feature-engineering

gan tensorflow

Automatic feature engineering using Generative Adversarial Networks using TensorFlow.

Stars: ✭ 48 (-91.01%)

Mutual labels: feature-extraction, feature-engineering

Lightautoml

LAMA - automatic model creation framework

Stars: ✭ 196 (-63.3%)

Mutual labels: data-science, feature-engineering

fastknn

Fast k-Nearest Neighbors Classifier for Large Datasets

Stars: ✭ 64 (-88.01%)

Mutual labels: feature-extraction, feature-engineering

mistql

A miniature lisp-like language for querying JSON-like structures. Tuned for clientside ML feature extraction.

Stars: ✭ 260 (-51.31%)

Mutual labels: feature-extraction, feature-engineering

Bike-Sharing-Demand-Kaggle

Top 5th percentile solution to the Kaggle knowledge problem - Bike Sharing Demand

Stars: ✭ 33 (-93.82%)

Mutual labels: feature-extraction, feature-engineering

featurewiz

Use advanced feature engineering strategies and select best features from your data set with a single line of code.

Stars: ✭ 229 (-57.12%)

Mutual labels: feature-extraction, feature-engineering

Nlpython

This repository contains the code related to Natural Language Processing using python scripting language. All the codes are related to my book entitled "Python Natural Language Processing"

Stars: ✭ 265 (-50.37%)

Mutual labels: feature-extraction, feature-engineering

My Data Competition Experience

本人多次机器学习与大数据竞赛Top5的经验总结，满满的干货，拿好不谢

Stars: ✭ 271 (-49.25%)

Mutual labels: data-science, feature-engineering

Color recognition

🎨 Color recognition & classification & detection on webcam stream / on video / on single image using K-Nearest Neighbors (KNN) is trained with color histogram features by OpenCV.

Stars: ✭ 154 (-71.16%)

Mutual labels: data-science, feature-extraction

View All Similar Projects ➔

MLFeatureSelection

General features selection based on certain machine learning algorithm and evaluation methods

Divesity, Flexible and Easy to use

More features selection method will be included in the future!

Quick Installation

pip3 install MLFeatureSelection

Modulus in version 0.0.9.5.1

Modulus for selecting features based on greedy algorithm (from MLFeatureSelection import sequence_selection)
Modulus for removing features based on features importance (from MLFeatureSelection import importance_selection)
Modulus for removing features based on correlation coefficient (from MLFeatureSelection import coherence_selection)
Modulus for reading the features combination from log file (from MLFeatureSelection.tools import readlog)

This features selection method achieved

1st in Rong360

-- https://github.com/duxuhao/rong360-season2

6th in JData-2018

-- https://github.com/duxuhao/JData-2018

12nd in IJCAI-2018 1st round

-- https://github.com/duxuhao/IJCAI-2018-2

Modulus Usage

Example

sequence_selection

from MLFeatureSelection import sequence_selection
from sklearn.linear_model import LogisticRegression

sf = sequence_selection.Select(Sequence = True, Random = True, Cross = False) 
sf.ImportDF(df,label = 'Label') #import dataframe and label
sf.ImportLossFunction(lossfunction, direction = 'ascend') #import loss function handle and optimize direction, 'ascend' for AUC, ACC, 'descend' for logloss etc.
sf.InitialNonTrainableFeatures(notusable) #those features that is not trainable in the dataframe, user_id, string, etc
sf.InitialFeatures(initialfeatures) #initial initialfeatures as list
sf.GenerateCol() #generate features for selection
sf.SetFeatureEachRound(50, False) # set number of feature each round, and set how the features are selected from all features (True: sample selection, False: select chunk by chunk)
sf.clf = LogisticRegression() #set the selected algorithm, can be any algorithm
sf.SetLogFile('record.log') #log file
sf.run(validate) #run with validation function, validate is the function handle of the validation function, return best features combination

importance_selection

from MLFeatureSelection import importance_selection
import xgboost as xgb

sf = importance_selection.Select() 
sf.ImportDF(df,label = 'Label') #import dataframe and label
sf.ImportLossFunction(lossfunction, direction = 'ascend') #import loss function and optimize direction
sf.InitialFeatures() #initial features, input
sf.SelectRemoveMode(batch = 2)
sf.clf = xgb.XGBClassifier() 
sf.SetLogFile('record.log') #log file
sf.run(validate) #run with validation function, return best features combination

coherence_selection

from MLFeatureSelection import coherence_selection
import xgboost as xgb

sf = coherence_selection.Select() 
sf.ImportDF(df,label = 'Label') #import dataframe and label
sf.ImportLossFunction(lossfunction, direction = 'ascend') #import loss function and optimize direction
sf.InitialFeatures() #initial features, input
sf.SelectRemoveMode(batch = 2)
sf.clf = xgb.XGBClassifier() 
sf.SetLogFile('record.log') #log file
sf.run(validate) #run with validation function, return best features combination

tools.readlog: read previous selected features from log

from MLFeatureSelection.tools import readlog

logfile = 'record.log'
logscore = 0.5 #any score in the logfile
features_combination = readlog(logfile, logscore)

tools.filldf: complete dataset when there is cross-term features

from MLFeatureSelection.tools import readlog, filldf

def add(x,y):
    return x + y

def substract(x,y):
    return x - y

def times(x,y):
    return x * y

def divide(x,y):
    return x/y

def sq(x,y):
    return x ** 2


CrossMethod = {'+':add,
               '-':substract,
               '*':times,
               '/':divide,
               } # set your own cross method

df = pd.read_csv('XXX')
logfile = 'record.log'
logscore = 0.5 #any score in the logfile
features_combination = readlog(logfile, logscore)
df = filldf(df, features_combination, CrossMethod)

format of validate and lossfunction

define your own:

validate: validation method in function , ie k-fold, last time section valdate, random sampling validation, etc

lossfunction: model performance evaluation method, ie logloss, auc, accuracy, etc

def validate(X, y, features, clf, lossfunction):
    """define your own validation function with 5 parameters
    input as X, y, features, clf, lossfunction
    clf is set by SetClassifier()
    lossfunction is import earlier
    features will be generate automatically
    function return score and trained classfier
    """
    clf.fit(X[features],y)
    y_pred = clf.predict(X[features])
    score = lossfuntion(y_pred,y)
    return score, clf
    
def lossfunction(y_pred, y_test):
    """define your own loss function with y_pred and y_test
    return score
    """
    return np.mean(y_pred == y_test)

multiple processing

Multiple processing can be set in validate function when you are doing N-fold.

DEMO

More examples are added in example folder include:

Demo contain all modulus can be found here (demo)
Simple Titanic with 5-fold validation and evaluated by accuracy (demo)
Demo for S1, S2 score improvement in JData 2018 predict purchase time competition (demo)
Demo for IJCAI 2018 CTR prediction (demo)

Function Parameters

Parameters

Algorithm details

Details

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 534

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (8) 🔗