All Projects → ksanjeevan → randomforest-density-python

ksanjeevan / randomforest-density-python

Licence: MIT license
Random Forests for Density Estimation in Python

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to randomforest-density-python

AverageShiftedHistograms.jl
⚡ Lightning fast density estimation in Julia ⚡
Stars: ✭ 52 (+116.67%)
Mutual labels:  density-estimation, kernel-density-estimation
LSTM-Time-Series-Analysis
Using LSTM network for time series forecasting
Stars: ✭ 41 (+70.83%)
Mutual labels:  random-forest
Machine-Learning-Models
In This repository I made some simple to complex methods in machine learning. Here I try to build template style code.
Stars: ✭ 30 (+25%)
Mutual labels:  random-forest
BifurcationInference.jl
learning state-space targets in dynamical systems
Stars: ✭ 24 (+0%)
Mutual labels:  kernel-density-estimation
receiptdID
Receipt.ID is a multi-label, multi-class, hierarchical classification system implemented in a two layer feed forward network.
Stars: ✭ 22 (-8.33%)
Mutual labels:  random-forest
introduction-to-machine-learning
A document covering machine learning basics. 🤖📊
Stars: ✭ 17 (-29.17%)
Mutual labels:  random-forest
R-stats-machine-learning
Misc Statistics and Machine Learning codes in R
Stars: ✭ 33 (+37.5%)
Mutual labels:  random-forest
bitcoin-prediction
bitcoin prediction algorithms
Stars: ✭ 21 (-12.5%)
Mutual labels:  random-forest
rfvis
A tool for visualizing the structure and performance of Random Forests 🌳
Stars: ✭ 20 (-16.67%)
Mutual labels:  random-forest
goscore
Go Scoring API for PMML
Stars: ✭ 85 (+254.17%)
Mutual labels:  random-forest
Github-Stars-Predictor
It's a github repo star predictor that tries to predict the stars of any github repository having greater than 100 stars.
Stars: ✭ 34 (+41.67%)
Mutual labels:  random-forest
cheapml
Machine Learning algorithms coded from scratch
Stars: ✭ 17 (-29.17%)
Mutual labels:  random-forest
Shapley regressions
Statistical inference on machine learning or general non-parametric models
Stars: ✭ 37 (+54.17%)
Mutual labels:  random-forest
wetlandmapR
Scripts, tools and example data for mapping wetland ecosystems using data driven R statistical methods like Random Forests and open source GIS
Stars: ✭ 16 (-33.33%)
Mutual labels:  random-forest
scoruby
Ruby Scoring API for PMML
Stars: ✭ 69 (+187.5%)
Mutual labels:  random-forest
Loan-Web
ML-powered Loan-Marketer Customer Filtering Engine
Stars: ✭ 13 (-45.83%)
Mutual labels:  random-forest
xforest
A super-fast and scalable Random Forest library based on fast histogram decision tree algorithm and distributed bagging framework. It can be used for binary classification, multi-label classification, and regression tasks. This library provides both Python and command line interface to users.
Stars: ✭ 20 (-16.67%)
Mutual labels:  random-forest
Gumbel-CRF
Implementation of NeurIPS 20 paper: Latent Template Induction with Gumbel-CRFs
Stars: ✭ 51 (+112.5%)
Mutual labels:  density-estimation
Online-Category-Learning
ML algorithm for real-time classification
Stars: ✭ 67 (+179.17%)
Mutual labels:  density-estimation
handson-ml
도서 "핸즈온 머신러닝"의 예제와 연습문제를 담은 주피터 노트북입니다.
Stars: ✭ 285 (+1087.5%)
Mutual labels:  random-forest

Density Estimation Forests in Python

Kernel Density Estimation in Random Forests

Usage

Running --help

usage: density_forest.py [-h] [-l LEAF] [-d DATA] [-g GRANULARITY]

randomforest-density: density estimation using random forests.

optional arguments:
  -h, --help            show this help message and exit
  -l LEAF, --leaf LEAF  Choose what leaf estimator to use (Gaussian ['gauss']
                        or KDE ['kde'])
  -d DATA, --data DATA  Path to data (.npy file, shape (sample_zie, 2)).
  -g GRANULARITY, --granularity GRANULARITY
                        Number of division for the Grid

Run demo (will produce all the plots seen below):

python3 density_forest.py -l kde

Run own data:

python3 density_forest.py -d data_test.npy -l gauss

Introduction

In probability and statistics, density estimation is the construction of an estimate, based on observed data, of an unobservable underlying probability density function. The unobservable density function is thought of as the density according to which a large population is distributed; the data are usually thought of as a random sample from that population.

Random forests are an ensemble learning method that operate by constructing a multitude of decision trees and combining their results for predictions. Random decision forests correct for decision trees' habit of overfitting to their training set.

In this project, a random forests method for density estimation is implemented in python. Following is the presentation of some of the steps, results, tests, and comparisons.

Random Forest Implementation

In this implementation, axis aligned split functions are used (called stumps) to build binary trees by optimizing the entropy gain at each node. The key parameters to select for this method are: tree depth/entropy gain threshold, forest size, and randomness.

The optimal depth of a tree will be case dependent. For that reason we first train a small set of trees on a fixed depth (tune_entropy_threshold method, parameters n and depth). Unlike forest size, where an increase will never yield worse results, a lax stop condition will lead to overfitting. The entropy gain is strictly decreasing with depth, as can be seen in the animation below:

Optimizing the entropy gain threshold is an ill-posed regularization problem, that is handled in this implementation by finding the elbow point of 'maximum depth' (point furthest from the line connecting the the function's extremes), and averaging it out over n, as we can see here:

This step is expensive, since the depth is fixed with no a priori indication of where the optimal threshold is, and the number of leafs that need to be fitted grows exponentially. A better approach would be to implement an online L-curve method (such as the ones discussed here) as a first pass to avoid initial over-splitting (pending).

From Decision Forests for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning:

A key aspect of decision forests is the fact that its component trees are all randomly different from one another. This leads to de-correlation between the individual tree predictions and, in turn, to improved generalization. Forest randomness also helps achieve high robustness with respect to noisy data. Randomness is injected into the trees during the training phase. Two of the most popular ways of doing so are:

These two techniques are not mutually exclusive and could be used together.

The method is tested by sampling a combination of gaussians. In order to introduce randomness the node optimization is randomized by parameter rho, with the available parameter search space at each node split proportional to it. With a 50% rho and 5 trees, we see the firsts results below:

Performance is harder to measure when using random forests for density estimation (as opposed to regression or classification) since we're in the unsupervised space. Here, the Jensen-Shannon Divergence is used as a comparison measure (whenever test data from a known distribution is used).

Leaf prediction using KDE

One of the main problems of Kernel Density Estimation is the choice of bandwidth. Many of the approaches to find it rely on assumptions of the underlying distribution, and perform poorly on clustered, real-world data (although there are methods that incroporate an adaptive bandwidth effectively).

The module can work with any imlementation of the Node class. In these first examples the NodeGauss class is used, by fitting a gaussian distribution at each leaf. Below can be seen the results of using NodeKDE, where the compactness measure is still based on the gaussian differential entropy, but the leaf prediction is the result of the KDE method. By looking for splits that optimize fitting a gaussian function, many of the multivariate bandwidth problems that KDE has are avoided, and Silverman's rule for bandwidth selection can be used with good results:

Although it produces an overall much better JSD, it's worth noting that the top right 'bump' is overfitting the noise more. This is expected since the underlying distribution of our test data is a combination of Gaussians, and if a leaf totally encompasses a bump (as can be seen in the forest representation below), then fitting a Gaussian function will perform better than any non parametric technique.

To do

  • Try other entropy gain functions / compactness measures.
  • Use online L-curve method for entropy threshold optimization.
  • Other bottlenecks.
  • Refactor to reuse framework in classification and regression.
  • Fit unknown distributions / performance.
  • Use EMD as comparison metric.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].