All Projects → AutoViML → featurewiz

AutoViML / featurewiz

Licence: Apache-2.0 License
Use advanced feature engineering strategies and select best features from your data set with a single line of code.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to featurewiz

exemplary-ml-pipeline
Exemplary, annotated machine learning pipeline for any tabular data problem.
Stars: ✭ 23 (-89.96%)
Mutual labels:  feature-selection, feature-engineering, featuretools
50-days-of-Statistics-for-Data-Science
This repository consist of a 50-day program. All the statistics required for the complete understanding of data science will be uploaded in this repository.
Stars: ✭ 19 (-91.7%)
Mutual labels:  feature-selection, feature-extraction, feature-engineering
feature engine
Feature engineering package with sklearn like functionality
Stars: ✭ 758 (+231%)
Mutual labels:  feature-selection, feature-extraction, feature-engineering
skrobot
skrobot is a Python module for designing, running and tracking Machine Learning experiments / tasks. It is built on top of scikit-learn framework.
Stars: ✭ 22 (-90.39%)
Mutual labels:  feature-selection, feature-engineering
tsflex
Flexible time series feature extraction & processing
Stars: ✭ 252 (+10.04%)
Mutual labels:  feature-extraction, feature-engineering
FIFA-2019-Analysis
This is a project based on the FIFA World Cup 2019 and Analyzes the Performance and Efficiency of Teams, Players, Countries and other related things using Data Analysis and Data Visualizations
Stars: ✭ 28 (-87.77%)
Mutual labels:  feature-selection, feature-engineering
Amazing Feature Engineering
Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms. Feature engineering can be considered as applied machine learning itself.
Stars: ✭ 218 (-4.8%)
Mutual labels:  feature-extraction, feature-engineering
msda
Library for multi-dimensional, multi-sensor, uni/multivariate time series data analysis, unsupervised feature selection, unsupervised deep anomaly detection, and prototype of explainable AI for anomaly detector
Stars: ✭ 80 (-65.07%)
Mutual labels:  feature-selection, feature-engineering
featuretoolsOnSpark
A simplified version of featuretools for Spark
Stars: ✭ 24 (-89.52%)
Mutual labels:  feature-engineering, featuretools
AutoTabular
Automatic machine learning for tabular data. ⚡🔥⚡
Stars: ✭ 51 (-77.73%)
Mutual labels:  xgboost, feature-engineering
dominance-analysis
This package can be used for dominance analysis or Shapley Value Regression for finding relative importance of predictors on given dataset. This library can be used for key driver analysis or marginal resource allocation models.
Stars: ✭ 111 (-51.53%)
Mutual labels:  feature-selection, feature-engineering
Market-Mix-Modeling
Market Mix Modelling for an eCommerce firm to estimate the impact of various marketing levers on sales
Stars: ✭ 31 (-86.46%)
Mutual labels:  feature-selection, feature-engineering
fastknn
Fast k-Nearest Neighbors Classifier for Large Datasets
Stars: ✭ 64 (-72.05%)
Mutual labels:  feature-extraction, feature-engineering
kaggle-berlin
Material of the Kaggle Berlin meetup group!
Stars: ✭ 36 (-84.28%)
Mutual labels:  xgboost, feature-engineering
Deep Learning Machine Learning Stock
Stock for Deep Learning and Machine Learning
Stars: ✭ 240 (+4.8%)
Mutual labels:  feature-extraction, feature-engineering
Bike-Sharing-Demand-Kaggle
Top 5th percentile solution to the Kaggle knowledge problem - Bike Sharing Demand
Stars: ✭ 33 (-85.59%)
Mutual labels:  feature-extraction, feature-engineering
Machine Learning Workflow With Python
This is a comprehensive ML techniques with python: Define the Problem- Specify Inputs & Outputs- Data Collection- Exploratory data analysis -Data Preprocessing- Model Design- Training- Evaluation
Stars: ✭ 157 (-31.44%)
Mutual labels:  feature-extraction, feature-engineering
Tsfel
An intuitive library to extract features from time series
Stars: ✭ 202 (-11.79%)
Mutual labels:  feature-extraction, feature-engineering
NVTabular
NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
Stars: ✭ 797 (+248.03%)
Mutual labels:  feature-selection, feature-engineering
pyHSICLasso
Versatile Nonlinear Feature Selection Algorithm for High-dimensional Data
Stars: ✭ 125 (-45.41%)
Mutual labels:  feature-selection, feature-extraction

featurewiz

banner

Update (March 2022)

  1. featurewiz as of version 0.1.04 or higher can read `feather-format` files at blazing speeds. See example below on how to convert your CSV files to feather. Then you can feed those '.ftr' files to featurewiz and it will read it 10-100X faster!

    feather_example

  2. featurewiz now runs at blazing speeds thanks to using GPU's by default. So if you are running a large data set on Colab and/or Kaggle, make sure you turn on the GPU kernels. featurewiz will automatically detect that GPU is turned on and will utilize XGBoost using GPU-hist. That will ensure it will crunch your datasets even faster. I have tested it with a very large data set and it reduced the running time from 52 mins to 1 minute! That's a 98% reduction in running time using GPU compared to CPU!

    Update (Jan 2022)

    1. FeatureWiz as of version 0.0.90 or higher is a scikit-learn compatible feature selection transformer. You can perform fit and predict as follows. You will get a Transformer that can select the top variables from your dataset. You can also use it in sklearn pipelines as a Transformer.
      from featurewiz import FeatureWiz
      features = FeatureWiz(corr_limit=0.70, feature_engg='', category_encoders='', 
      dask_xgboost_flag=False, nrows=None, verbose=2)
      X_train_selected = features.fit_transform(X_train, y_train)
      X_test_selected = features.transform(X_test)
      features.features  ### provides the list of selected features ###
      
    2. Featurewiz is now upgraded with XGBOOST 1.5.1 for DASK for blazing fast performance even for very large data sets! Set `dask_xgboost_flag = True` to run dask + xgboost.
    3. Featurewiz now runs with a default setting of `nrows=None`. This means it will run using all rows. But if you want it to run faster, then you can change `nrows` to 1000 or whatever, so it will sample that many rows and run.
    4. Featurewiz has four (4) new fast model builder functions: that you can use to build highly performant models with the features selected by featurewiz. They are:
      1. `simple_LightGBM_model()` - simple regression and classification with one target label
      2. `simple_XGBoost_model()` - simple regression and classification with one target label
      3. `complex_LightGBM_model()` - more complex multi-label and multi-class models
      4. `complex_XGBoost_model()` - more complex multi-label and multi-class models
      5. `Stacking_Classifier()`: Stacking model that can handle multi-label, multi-class problems
      6. `Stacking_Regressor()`: Stacking model that can handle multi-label, regression problems
      7. `Blending_Regressor()`: Blending model that can handle multi-label, regression problems

    One word of CAUTION while installing featurewiz in Kaggle and other environments:

    You must install featurewiz without any dependencies and by ignoring previous installed versions (see below). You MUST execute these TWO steps if you want featurwiz installed and working smoothly.

    pip install xlrd

    pip install featurewiz --ignore-installed --no-deps

    What is featurewiz?

    featurewiz a new python library for creating and selecting the best features in your data set fast! featurewiz can be used in one or two ways. Both are explained below.

    1. Feature Engineering

    The first step is not absolutely necessary but it can be used to create new features that may or may not be helpful (be careful with automated feature engineering tools!).

    1. Performing Feature Engineering: One of the gaps in open source AutoML tools and especially Auto_ViML has been the lack of feature engineering capabilities that high powered competitions such as Kaggle required. The ability to create "interaction" variables or adding "group-by" features or "target-encoding" categorical variables was difficult and sifting through those hundreds of new features to find best features was difficult and left only to "experts" or "professionals". featurewiz was created to help you in this endeavor.

    featurewiz now enables you to add hundreds of such features with a single line of code. Set the "feature_engg" flag to "interactions", "groupby" or "target" and featurewiz will select the best encoders for each of those options and create hundreds (perhaps thousands) of features in one go. Not only that, using the next step, featurewiz will sift through numerous such variables and find only the least correlated and most relevant features to your model. All in one step!.

    feature_engg

    2. Feature Selection

    The second step is Feature Selection. `featurewiz` uses the MRMR (Minimum Redundancy Maximum Relevance) algorithm as the basis for its feature selection.
    Why do Feature Selection? Once you have created 100's of new features, you still have three questions left to answer: 1. How do we interpret those newly created features? 2. Which of these features is important and which are useless? How many of them are highly correlated to each other causing redundancy? 3. Does the model overfit now on these new features and perform better or worse than before?
    All are very important questions and featurewiz answers them by using the SULOV method and Recursive XGBoost to reduce features in your dataset to the best "minimum optimal" features for the model.

    SULOV: SULOV stands for `Searching for Uncorrelated List of Variables`. The SULOV algorithm is based on the Minimum-Redundancy-Maximum-Relevance (MRMR) algorithm explained in this article as one of the best feature selection methods. To understand how MRMR works and how it is different from `Boruta` and other feature selection methods, see the chart below. Here "Minimal Optimal" refers to the MRMR and featurewiz kind of algorithms while "all-relevant" refers to Boruta kind of algorithms.

    MRMR_chart
    The working of the SULOV algorithm is as follows:

    1. Find all the pairs of highly correlated variables exceeding a correlation threshold (say absolute(0.7)).
    2. Then find their MIS score (Mutual Information Score) to the target variable. MIS is a non-parametric scoring method. So its suitable for all kinds of variables and target.
    3. Now take each pair of correlated variables, then knock off the one with the lower MIS score.
    4. What’s left is the ones with the highest Information scores and least correlation with each other.

    sulov

    Recursive XGBoost: Once SULOV has selected variables that have high mutual information scores with least less correlation amongst them, we use XGBoost to repeatedly find best features among the remaining variables after SULOV. The Recursive XGBoost method is explained in this chart below. Here is how it works:

    1. Select all variables in data set and the full data split into train and valid sets.
    2. Find top X features (could be 10) on train using valid for early stopping (to prevent over-fitting)
    3. Then take next set of vars and find top X
    4. Do this 5 times. Combine all selected features and de-duplicate them.

    xgboost

    Building the simplest and most "interpretable" model: featurewiz represents the "next best" step you must perform after doing feature engineering since you might have added some highly correlated or even useless features when you use automated feature engineering. featurewiz ensures you have the least number of features needed to build a high performing or equivalent model.

    A WORD OF CAUTION: Just because you can engineer new features, doesn't mean you should always create tons of new features. You must make sure you understand what the new features stand for before you attempt to build a model with these (sometimes useless) features. featurewiz displays the SULOV chart which can show you how the 100's of newly created variables added to your dataset are highly correlated to each other and were removed. This will help you understand how feature selection works in featurewiz.

    Table of Contents

    Background

    background

    To learn more about how featurewiz works under the hood, watch this video

    featurewiz was designed for selecting High Performance variables with the fewest steps.

    In most cases, featurewiz builds models with 20%-99% fewer features than your original data set with nearly the same or slightly lower performance (this is based on my trials. Your experience may vary).

    featurewiz is every Data Scientist's feature wizard that will:

    1. Automatically pre-process data: you can send in your entire dataframe "as is" and featurewiz will classify and change/label encode categorical variables changes to help XGBoost processing. It classifies variables as numeric or categorical or NLP or date-time variables automatically so it can use them correctly to model.
    2. Perform feature engineering automatically: The ability to create "interaction" variables or adding "group-by" features or "target-encoding" categorical variables is difficult and sifting through those hundreds of new features is painstaking and left only to "experts". Now, with featurewiz you can create hundreds or even thousands of new features with the click of a mouse. This is very helpful when you have a small number of features to start with. However, be careful with this option. You can very easily create a monster with this option.
    3. Perform feature reduction automatically. When you have small data sets and you know your domain well, it is easy to perhaps do EDA and identify which variables are important. But when you have a very large data set with hundreds if not thousands of variables, selecting the best features from your model can mean the difference between a bloated and highly complex model or a simple model with the fewest and most information-rich features. featurewiz uses XGBoost repeatedly to perform feature selection. You must try it on your large data sets and compare!
    4. Explain SULOV method graphically using networkx library so you can see which variables are highly correlated to which ones and which of those have high or low mutual information scores automatically. Just set verbose = 2 to see the graph.
    5. Build a fast LightGBM model using the features selected by featurewiz. There is a function called "simple_lightgbm_model" which you can use to build a fast model. It is a new module, so check it out.

    *** Notes of Gratitude ***:

    1. Alex Lekov (https://github.com/Alex-Lekov/AutoML_Alex/tree/master/automl_alex) for his DataBunch and encoders modules which are used by the tool (although with some modifications).
    2. Category Encoders library in Python : This is an amazing library. Make sure you read all about the encoders that featurewiz uses here: https://contrib.scikit-learn.org/category_encoders/index.html

    Install

    Prerequsites:

    1. featurewiz is built using xgboost, dask, numpy, pandas and matplotlib. It should run on most Python 3 Anaconda installations. You won't have to import any special libraries other than "dask", "XGBoost" and "networkx" library. Optionally, it uses LightGBM for fast modeling, which it installs automatically.
    2. We use "networkx" library for charts and interpretability.
      But if you don't have these libraries, featurewiz will install those for you automatically.
    - [Anaconda](https://docs.anaconda.com/anaconda/install/)

    To clone featurewiz, it is better to create a new environment, and install the required dependencies:

    To install from PyPi:

    conda create -n <your_env_name> python=3.7 anaconda
    conda activate <your_env_name> # ON WINDOWS: `source activate <your_env_name>`
    pip install featurewiz --ignore-installed --no-deps
    or
    pip install git+https://github.com/AutoViML/featurewiz.git
    

    To install from source:

    cd <featurewiz_Destination>
    git clone [email protected]:AutoViML/featurewiz.git
    # or download and unzip https://github.com/AutoViML/featurewiz/archive/master.zip
    conda create -n <your_env_name> python=3.7 anaconda
    conda activate <your_env_name> # ON WINDOWS: `source activate <your_env_name>`
    cd featurewiz
    pip install -r requirements.txt
    

    Usage

    As of Jan 2022, you now invoke featurewiz as a scikit-learn compatible fit and predict transformer pipeline. See syntax below.

    from featurewiz import FeatureWiz
    features = FeatureWiz(corr_limit=0.70, feature_engg='', category_encoders='', dask_xgboost_flag=False, nrows=None, verbose=2)
    X_train_selected = features.fit_transform(X_train, y_train)
    X_test_selected = features.transform(X_test)
    features.features  ### provides the list of selected features ###
    

    Alternatively, you can continue to use the existing featurewiz function as it is now:

    import featurewiz as FW
    

    Load a data set (any CSV or text file) into a Pandas dataframe and give it the name of the target(s) variable. If you have more than one target, it will handle multi-label targets too. Just give it a list of variables in that case. If you don't have a dataframe, you can simply enter the name and path of the file to load into featurewiz:

    outputs = FW.featurewiz(dataname, target, corr_limit=0.70, verbose=2, sep=',', 
    		header=0, test_data='',feature_engg='', category_encoders='',
    		dask_xgboost_flag=False, nrows=None)
    

    outputs: There will always be multiple objects in output. The objects in that tuple can vary:

    1. "features" and "train": It be a list (of selected features) and one dataframe (if you sent in train only)
    2. "trainm" and "testm": It can be two dataframes when you send in both test and train but with selected features.
    1. Both the selected features and dataframes are ready for you to now to do further modeling.
    2. Featurewiz works on any multi-class, multi-label data Set. So you can have as many target labels as you want.
    3. You don't have to tell Featurewiz whether it is a Regression or Classification problem. It will decide that automatically.

    API

    Arguments

    • dataname: could be a datapath+filename or a dataframe. It will detect whether your input is a filename or a dataframe and load it automatically.
    • target: name of the target variable in the data set.
    • corr_limit: if you want to set your own threshold for removing variables as highly correlated, then give it here. The default is 0.7 which means variables less than -0.7 and greater than 0.7 in pearson's correlation will be candidates for removal.
    • verbose: This has 3 possible states:
      • 0 limited output. Great for running this silently and getting fast results.
      • 1 more verbiage. Great for knowing how results were and making changes to flags in input.
      • 2 SULOV charts and output. Great for finding out what happens under the hood for SULOV method.
    • test_data: If you want to transform test data in the same way you are transforming dataname, you can. test_data could be the name of a datapath+filename or a dataframe. featurewiz will detect whether your input is a filename or a dataframe and load it automatically. Default is empty string.
    • feature_engg: You can let featurewiz select its best encoders for your data set by setting this flag for adding feature engineering. There are three choices. You can choose one, two or all three.
      • interactions: This will add interaction features to your data such as x1x2, x2x3, x12, x22, etc.
      • groupby: This will generate Group By features to your numeric vars by grouping all categorical vars.
      • target: This will encode and transform all your categorical features using certain target encoders.
        Default is empty string (which means no additional features)
    • category_encoders: Instead of above method, you can choose your own kind of category encoders from the list below. Recommend you do not use more than two of these. Featurewiz will automatically select only two from your list. Default is empty string (which means no encoding of your categorical features)
      These descriptions are derived from the excellent category_encoders python library. Please check it out!
      • HashingEncoder: HashingEncoder is a multivariate hashing implementation with configurable dimensionality/precision. The advantage of this encoder is that it does not maintain a dictionary of observed categories. Consequently, the encoder does not grow in size and accepts new values during data scoring by design.
      • SumEncoder: SumEncoder is a Sum contrast coding for the encoding of categorical features.
      • PolynomialEncoder: PolynomialEncoder is a Polynomial contrast coding for the encoding of categorical features.
      • BackwardDifferenceEncoder: BackwardDifferenceEncoder is a Backward difference contrast coding for encoding categorical variables.
      • OneHotEncoder: OneHotEncoder is the traditional Onehot (or dummy) coding for categorical features. It produces one feature per category, each being a binary.
      • HelmertEncoder: HelmertEncoder uses the Helmert contrast coding for encoding categorical features.
      • OrdinalEncoder: OrdinalEncoder uses Ordinal encoding to designate a single column of integers to represent the categories in your data. Integers however start in the same order in which the categories are found in your dataset. If you want to change the order, just sort the column and send it in for encoding.
      • FrequencyEncoder: FrequencyEncoder is a count encoding technique for categorical features. For a given categorical feature, it replaces the names of the categories with the group counts of each category.
      • BaseNEncoder: BaseNEncoder encodes the categories into arrays of their base-N representation. A base of 1 is equivalent to one-hot encoding (not really base-1, but useful), a base of 2 is equivalent to binary encoding. N=number of actual categories is equivalent to vanilla ordinal encoding.
      • TargetEncoder: TargetEncoder performs Target encoding for categorical features. It supports following kinds of targets: binary and continuous. For multi-class targets it uses a PolynomialWrapper.
      • CatBoostEncoder: CatBoostEncoder performs CatBoost coding for categorical features. It supports the following kinds of targets: binary and continuous. For polynomial target support, it uses a PolynomialWrapper. This is very similar to leave-one-out encoding, but calculates the values “on-the-fly”. Consequently, the values naturally vary during the training phase and it is not necessary to add random noise.
      • WOEEncoder: WOEEncoder uses the Weight of Evidence technique for categorical features. It supports only one kind of target: binary. For polynomial target support, it uses a PolynomialWrapper. It cannot be used for Regression.
      • JamesSteinEncoder: JamesSteinEncoder uses the James-Stein estimator. It supports 2 kinds of targets: binary and continuous. For polynomial target support, it uses PolynomialWrapper. For feature value i, James-Stein estimator returns a weighted average of: The mean target value for the observed feature value i. The mean target value (regardless of the feature value).
      • dask_xgboost_flag: Default is False. Set to True to use dask_xgboost estimator. You can turn it off if it gives an error. Then it will use pandas and regular xgboost to do the job.
      • nrows: default None. You can set the number of rows to read from your datafile if it is too large to fit into either dask or pandas. But you won't have to if you use dask. Return values
    • outputs: Output is always a tuple. We can call our outputs in that tuple: out1 and out2.
      • out1 and out2: If you sent in just one dataframe or filename as input, you will get:
          1. features: It will be a list (of selected features) and
          1. trainm: It will be a dataframe (if you sent in a file or dataname as input)
      • out1 and out2: If you sent in two files or dataframes (train and test), you will get:
          1. trainm: a modified train dataframe with engineered and selected features from dataname and
          1. testm: a modified test dataframe with engineered and selected features from test_data.

    Maintainers

    Contributing

    See the contributing file!

    PRs accepted.

    License

    Apache License 2.0 © 2020 Ram Seshadri

    DISCLAIMER

    This project is not an official Google project. It is not supported by Google and Google specifically disclaims all warranties as to its quality, merchantability, or fitness for a particular purpose.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].