All Projects → duxuhao → Feature Selection

duxuhao / Feature Selection

Licence: mit
Features selector based on the self selected-algorithm, loss function and validation method

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Feature Selection

Tsfel
An intuitive library to extract features from time series
Stars: ✭ 202 (-62.17%)
Mutual labels:  data-science, feature-extraction, feature-engineering
Deep Learning Machine Learning Stock
Stock for Deep Learning and Machine Learning
Stars: ✭ 240 (-55.06%)
Mutual labels:  data-science, feature-extraction, feature-engineering
Amazing Feature Engineering
Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms. Feature engineering can be considered as applied machine learning itself.
Stars: ✭ 218 (-59.18%)
Mutual labels:  data-science, feature-extraction, feature-engineering
Kaggle Competitions
There are plenty of courses and tutorials that can help you learn machine learning from scratch but here in GitHub, I want to solve some Kaggle competitions as a comprehensive workflow with python packages. After reading, you can use this workflow to solve other real problems and use it as a template.
Stars: ✭ 86 (-83.9%)
Mutual labels:  data-science, feature-extraction, feature-engineering
Deltapy
DeltaPy - Tabular Data Augmentation (by @firmai)
Stars: ✭ 344 (-35.58%)
Mutual labels:  data-science, feature-extraction, feature-engineering
Awesome Feature Engineering
A curated list of resources dedicated to Feature Engineering Techniques for Machine Learning
Stars: ✭ 433 (-18.91%)
Mutual labels:  data-science, feature-extraction, feature-engineering
Blurr
Data transformations for the ML era
Stars: ✭ 96 (-82.02%)
Mutual labels:  data-science, feature-extraction, feature-engineering
Nni
An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
Stars: ✭ 10,698 (+1903.37%)
Mutual labels:  data-science, feature-extraction, feature-engineering
tsflex
Flexible time series feature extraction & processing
Stars: ✭ 252 (-52.81%)
Mutual labels:  feature-extraction, feature-engineering
50-days-of-Statistics-for-Data-Science
This repository consist of a 50-day program. All the statistics required for the complete understanding of data science will be uploaded in this repository.
Stars: ✭ 19 (-96.44%)
Mutual labels:  feature-extraction, feature-engineering
autoencoders tensorflow
Automatic feature engineering using deep learning and Bayesian inference using TensorFlow.
Stars: ✭ 66 (-87.64%)
Mutual labels:  feature-extraction, feature-engineering
gan tensorflow
Automatic feature engineering using Generative Adversarial Networks using TensorFlow.
Stars: ✭ 48 (-91.01%)
Mutual labels:  feature-extraction, feature-engineering
Lightautoml
LAMA - automatic model creation framework
Stars: ✭ 196 (-63.3%)
Mutual labels:  data-science, feature-engineering
fastknn
Fast k-Nearest Neighbors Classifier for Large Datasets
Stars: ✭ 64 (-88.01%)
Mutual labels:  feature-extraction, feature-engineering
mistql
A miniature lisp-like language for querying JSON-like structures. Tuned for clientside ML feature extraction.
Stars: ✭ 260 (-51.31%)
Mutual labels:  feature-extraction, feature-engineering
Bike-Sharing-Demand-Kaggle
Top 5th percentile solution to the Kaggle knowledge problem - Bike Sharing Demand
Stars: ✭ 33 (-93.82%)
Mutual labels:  feature-extraction, feature-engineering
featurewiz
Use advanced feature engineering strategies and select best features from your data set with a single line of code.
Stars: ✭ 229 (-57.12%)
Mutual labels:  feature-extraction, feature-engineering
Nlpython
This repository contains the code related to Natural Language Processing using python scripting language. All the codes are related to my book entitled "Python Natural Language Processing"
Stars: ✭ 265 (-50.37%)
Mutual labels:  feature-extraction, feature-engineering
My Data Competition Experience
本人多次机器学习与大数据竞赛Top5的经验总结,满满的干货,拿好不谢
Stars: ✭ 271 (-49.25%)
Mutual labels:  data-science, feature-engineering
Color recognition
🎨 Color recognition & classification & detection on webcam stream / on video / on single image using K-Nearest Neighbors (KNN) is trained with color histogram features by OpenCV.
Stars: ✭ 154 (-71.16%)
Mutual labels:  data-science, feature-extraction

MLFeatureSelection

License: MIT PyPI version

General features selection based on certain machine learning algorithm and evaluation methods

Divesity, Flexible and Easy to use

More features selection method will be included in the future!

Quick Installation

pip3 install MLFeatureSelection

Modulus in version 0.0.9.5.1

  • Modulus for selecting features based on greedy algorithm (from MLFeatureSelection import sequence_selection)

  • Modulus for removing features based on features importance (from MLFeatureSelection import importance_selection)

  • Modulus for removing features based on correlation coefficient (from MLFeatureSelection import coherence_selection)

  • Modulus for reading the features combination from log file (from MLFeatureSelection.tools import readlog)

This features selection method achieved

  • 1st in Rong360

-- https://github.com/duxuhao/rong360-season2

  • 6th in JData-2018

-- https://github.com/duxuhao/JData-2018

  • 12nd in IJCAI-2018 1st round

-- https://github.com/duxuhao/IJCAI-2018-2

Modulus Usage

Example

  • sequence_selection
from MLFeatureSelection import sequence_selection
from sklearn.linear_model import LogisticRegression

sf = sequence_selection.Select(Sequence = True, Random = True, Cross = False) 
sf.ImportDF(df,label = 'Label') #import dataframe and label
sf.ImportLossFunction(lossfunction, direction = 'ascend') #import loss function handle and optimize direction, 'ascend' for AUC, ACC, 'descend' for logloss etc.
sf.InitialNonTrainableFeatures(notusable) #those features that is not trainable in the dataframe, user_id, string, etc
sf.InitialFeatures(initialfeatures) #initial initialfeatures as list
sf.GenerateCol() #generate features for selection
sf.SetFeatureEachRound(50, False) # set number of feature each round, and set how the features are selected from all features (True: sample selection, False: select chunk by chunk)
sf.clf = LogisticRegression() #set the selected algorithm, can be any algorithm
sf.SetLogFile('record.log') #log file
sf.run(validate) #run with validation function, validate is the function handle of the validation function, return best features combination
  • importance_selection
from MLFeatureSelection import importance_selection
import xgboost as xgb

sf = importance_selection.Select() 
sf.ImportDF(df,label = 'Label') #import dataframe and label
sf.ImportLossFunction(lossfunction, direction = 'ascend') #import loss function and optimize direction
sf.InitialFeatures() #initial features, input
sf.SelectRemoveMode(batch = 2)
sf.clf = xgb.XGBClassifier() 
sf.SetLogFile('record.log') #log file
sf.run(validate) #run with validation function, return best features combination
  • coherence_selection
from MLFeatureSelection import coherence_selection
import xgboost as xgb

sf = coherence_selection.Select() 
sf.ImportDF(df,label = 'Label') #import dataframe and label
sf.ImportLossFunction(lossfunction, direction = 'ascend') #import loss function and optimize direction
sf.InitialFeatures() #initial features, input
sf.SelectRemoveMode(batch = 2)
sf.clf = xgb.XGBClassifier() 
sf.SetLogFile('record.log') #log file
sf.run(validate) #run with validation function, return best features combination
  • tools.readlog: read previous selected features from log
from MLFeatureSelection.tools import readlog

logfile = 'record.log'
logscore = 0.5 #any score in the logfile
features_combination = readlog(logfile, logscore)
  • tools.filldf: complete dataset when there is cross-term features
from MLFeatureSelection.tools import readlog, filldf

def add(x,y):
    return x + y

def substract(x,y):
    return x - y

def times(x,y):
    return x * y

def divide(x,y):
    return x/y

def sq(x,y):
    return x ** 2


CrossMethod = {'+':add,
               '-':substract,
               '*':times,
               '/':divide,
               } # set your own cross method

df = pd.read_csv('XXX')
logfile = 'record.log'
logscore = 0.5 #any score in the logfile
features_combination = readlog(logfile, logscore)
df = filldf(df, features_combination, CrossMethod)
  • format of validate and lossfunction

define your own:

validate: validation method in function , ie k-fold, last time section valdate, random sampling validation, etc

lossfunction: model performance evaluation method, ie logloss, auc, accuracy, etc

def validate(X, y, features, clf, lossfunction):
    """define your own validation function with 5 parameters
    input as X, y, features, clf, lossfunction
    clf is set by SetClassifier()
    lossfunction is import earlier
    features will be generate automatically
    function return score and trained classfier
    """
    clf.fit(X[features],y)
    y_pred = clf.predict(X[features])
    score = lossfuntion(y_pred,y)
    return score, clf
    
def lossfunction(y_pred, y_test):
    """define your own loss function with y_pred and y_test
    return score
    """
    return np.mean(y_pred == y_test)

multiple processing

Multiple processing can be set in validate function when you are doing N-fold.

DEMO

More examples are added in example folder include:

  • Demo contain all modulus can be found here (demo)

  • Simple Titanic with 5-fold validation and evaluated by accuracy (demo)

  • Demo for S1, S2 score improvement in JData 2018 predict purchase time competition (demo)

  • Demo for IJCAI 2018 CTR prediction (demo)

Function Parameters

Parameters

Algorithm details

Details

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].