All Projects → ericqu → LinearRegression.jl

ericqu / LinearRegression.jl

Licence: MIT license
Linear Regression for Julia

Programming Languages

julia
2034 projects

Projects that are alternatives of or similar to LinearRegression.jl

regression-python
In this repository you can find many different, small, projects which demonstrate regression techniques using python programming language
Stars: ✭ 15 (+25%)
Mutual labels:  regression
combining3Dmorphablemodels
Project Page of Combining 3D Morphable Models: A Large scale Face-and-Head Model - [CVPR 2019]
Stars: ✭ 80 (+566.67%)
Mutual labels:  regression
ml course
"Learning Machine Learning" Course, Bogotá, Colombia 2019 #LML2019
Stars: ✭ 22 (+83.33%)
Mutual labels:  regression
FixedEffectjlr
R interface for Fixed Effect Models
Stars: ✭ 20 (+66.67%)
Mutual labels:  regression
smogn
Synthetic Minority Over-Sampling Technique for Regression
Stars: ✭ 238 (+1883.33%)
Mutual labels:  regression
DataScience ArtificialIntelligence Utils
Examples of Data Science projects and Artificial Intelligence use cases
Stars: ✭ 302 (+2416.67%)
Mutual labels:  regression
SimpleGP
Simple Genetic Programming for Symbolic Regression in Python3
Stars: ✭ 20 (+66.67%)
Mutual labels:  regression
R-stats-machine-learning
Misc Statistics and Machine Learning codes in R
Stars: ✭ 33 (+175%)
Mutual labels:  regression
GaNDLF
A generalizable application framework for segmentation, regression, and classification using PyTorch
Stars: ✭ 77 (+541.67%)
Mutual labels:  regression
lyapy
Library for simulation of nonlinear control systems, control design, and Lyapunov-based learning.
Stars: ✭ 35 (+191.67%)
Mutual labels:  regression
MachineLearning
Machine learning for beginner(Data Science enthusiast)
Stars: ✭ 104 (+766.67%)
Mutual labels:  regression
StoreItemDemand
(117th place - Top 26%) Deep learning using Keras and Spark for the "Store Item Demand Forecasting" Kaggle competition.
Stars: ✭ 24 (+100%)
Mutual labels:  regression
prediction
Tidy, Type-Safe 'prediction()' Methods
Stars: ✭ 86 (+616.67%)
Mutual labels:  regression
polyssifier
run a multitude of classifiers on you data and get an AUC report
Stars: ✭ 64 (+433.33%)
Mutual labels:  regression
Regression
Multiple Regression Package for PHP
Stars: ✭ 88 (+633.33%)
Mutual labels:  regression
tensorscript
REPO MOVED TO https://repetere.github.io/jsonstack-model - Deep Learning Classification, Clustering, LSTM Time Series and Regression with Tensorflow
Stars: ✭ 37 (+208.33%)
Mutual labels:  regression
lolo
A random forest
Stars: ✭ 37 (+208.33%)
Mutual labels:  regression
MLweb
Machine learning and scientific computing (linear algebra, statistics, optimization) javascript libraries, with an online lab.
Stars: ✭ 85 (+608.33%)
Mutual labels:  regression
broomExtra
Helpers for regression analyses using `{broom}` & `{easystats}` packages 📈 🔍
Stars: ✭ 45 (+275%)
Mutual labels:  regression
pyowl
Ordered Weighted L1 regularization for classification and regression in Python
Stars: ✭ 52 (+333.33%)
Mutual labels:  regression

Change of name

The package has been renamed LinearRegressionKit.jl because the name LinearRegression.jl was already registered. The package LinearRegression.jl won't be updated anymore, and you should use LinearRegressionKit.jl from now on. If there is any question please ask them on LinearRegressionKit.jl

LinearRegression.jl

LinearRegression.jl implements linear regression using the least-squares algorithm (relying on the sweep operator). This package is in the alpha stage. Hence it is likely that some bugs exist. Furthermore, the API might change in future versions. User's or prospective users' feedback is welcome.

Installation

Enter the Pkg REPL by pressing ] from the Julia REPL. Then install the package with: pkg> add https://github.com/ericqu/LinearRegression.jl.git. To uninstall use pkg> rm LinearRegression

Usage

The following is a simple usage:

using LinearRegression, DataFrames, StatsModels

x = [0.68, 0.631, 0.348, 0.413, 0.698, 0.368, 0.571, 0.433, 0.252, 0.387, 0.409, 0.456, 0.375, 0.495, 0.55, 0.576, 0.265, 0.299, 0.612, 0.631]
y = [15.72, 14.86, 6.14, 8.21, 17.07, 9.07, 14.68, 10.37, 5.18, 9.36, 7.61, 10.43, 8.93, 10.33, 14.46, 12.39, 4.06, 4.67, 13.73, 14.75]

df = DataFrame(y=y, x=x)

lr = regress(@formula(y ~ 1 + x), df)

which outputs the following information:

Model definition:       y ~ 1 + x
Used observations:      20
Model statistics:
  R²: 0.938467                  Adjusted R²: 0.935049
  MSE: 1.01417                  RMSE: 1.00706
  σ̂²: 1.01417
Confidence interval: 95%

Coefficients statistics:
Terms ╲ Stats │       Coefs      Std err            t     Pr(>|t|)       low ci      high ci
──────────────┼─────────────────────────────────────────────────────────────────────────────
(Intercept)   │    -2.44811     0.819131     -2.98867     0.007877     -4.16904    -0.727184
x             │     27.6201      1.66699      16.5688  2.41337e-12      24.1179      31.1223

Contrasts with Julia Stats GLM package

First, the GLM package provides more than linear regression with Ordinary Least-Squares through the Generalized Linear Model with Maximum Likelihood Estimation.

LinearRegression now accept model without intercept. Like models made with GLM the intercept is implicit, and to enable the no intercept the user must specify it in the formula (for instance y ~ 0 + x).

LinearRegression now supports analytical weights; GLM supports frequency weights.

Both LinearRegression and GLM rely on StatsModels.jl for the model's description (@formula); hence it is easy to move between the two packages. Similarly, contrasts and categorical variables are defined in the same way facilitating moving from one to the other when needed.

LinearRegression relies on the Sweep operator to estimate the coefficients, and GLM depends on Cholesky and QR factorizations.

The Akaike information criterion (AIC) is calculated with the formula relevant only for Linear Regression hence enabling comparison between linear regressions (AIC=n log(SSE / n) + 2p; where SSE is the Sum of Squared Errors and p is the number of predictors). On the other hand, the AIC calculated with GLM is more general (based on log-likelihood), enabling comparison between a broader range of models.

LinearRegression package provides access to some robust covariance estimators (for Heteroscedasticity: White, HC0, HC1, HC2 and HC3 and for HAC: Newey-West)

List of Statistics

List of Statistics calculated about the linear regression model:

  • AIC: Akaike information criterion with the formula AIC=n log(SSE / n) + 2p; where SSE is the Sum of Squared Errors and p is the number of predictors.
  • SSE Sum of Squared Errors as the output from the sweep operator.
  • SST as the Total Sum of Squares as the sum over all squared differences between the observations and their overall mean.
  • R² as 1 - SSE/SST.
  • Adjusted R².
  • σ̂² (sigma) Estimate of the error variance.
  • Variance Inflation Factor.
  • CI the confidence interval based the \alpha default value of 0.05 giving the 95% confidence interval.
  • The t-statistic.
  • The mean squared error.
  • The root of the mean squared error.
  • The standard errors and their equivalent with a Heteroscedasticity or HAC covariance estimator
  • The t values and their equivalent with a Heteroscedasticity or HAC covariance estimator
  • P values for each predictor and their equivalent with a Heteroscedasticity or HAC covariance estimator
  • Type 1 & 2 Sum of squares
  • Squared partial correlation coefficient, squared semi-partial correlation coefficient.
  • PRESS as the sum of square of predicted residuals errors

List of Statistics about the predicted values:

  • The predicted values
  • The residuals values (as the actual values minus the predicted ones)
  • The Leverage or the i-th diagonal element of the projection matrix.
  • STDI is the standard error of the individual predicted value.
  • STDP is the standard error of the mean predicted value
  • STDR is the standard error of the residual
  • Student as the studentized residuals also knows as the Standardized residuals or internally studentized residuals.
  • Rstudent is the studentized residual with the current observation deleted.
  • LCLI is the lower bound of the confidence interval for the individual prediction.
  • UCLI is the upper bound of the confidence interval for the individual prediction.
  • LCLP is the lower bound of the confidence interval for the expected (mean) value.
  • UCLP is the upper bound of the confidence interval for the expected (mean) value.
  • Cook's Distance
  • PRESS as predicted residual errors

Questions and Feedback

Please post your questions, feedabck or issues in the Issues tabs. As much as possible, please provide relevant contextual information.

Credits and additional information

Examples

The following is a short example illustrating some statistics about the predicted data. First, a simulation of some data with a polynomial function.

using LinearRegression, DataFrames, StatsModels
using Distributions # for the data generation with Normal() and Uniform()
using VegaLite

# Data simulation
f(x) = @. (x^3 + 2.2345x - 1.2345 + rand(Normal(0, 20)))
xs = [x for x in -2:0.1:8]
ys = f(xs)
vdf = DataFrame(y=ys, x=xs)

Then we can make the first model and look at the results:

lr, ps = regress(@formula(y ~ 1 + x^3 ), vdf, "all", 
    req_stats=["default", "vif", "AIC"], 
    plot_args=Dict("plot_width" => 200 ))
lr
Model definition:       y ~ 1 + x
Used observations:      101
Model statistics:
  R²: 0.750957                  Adjusted R²: 0.748441
  MSE: 5693.68                  RMSE: 75.4565
  σ̂²: 5693.68                   AIC: 875.338
Confidence interval: 95%

Coefficients statistics:
Terms ╲ Stats │       Coefs      Std err            t     Pr(>|t|)       low ci      high ci          VIF
──────────────┼──────────────────────────────────────────────────────────────────────────────────────────
(Intercept)   │    -24.5318      10.7732     -2.27711    0.0249316     -45.9082     -3.15535          0.0
x             │     44.4953      2.57529      17.2778  1.20063e-31      39.3854      49.6052          1.0

This is pretty good, so let's further review some diagnostic plots.

[[ps["fit"] ps["residuals"]]
    [ps["histogram density"] ps["qq plot"]]]

Overview Plots

Please note that for the fit plot, the orange line shows the regression line, in dark grey the confidence interval for the mean, and in light grey the interval for the individuals predictions.

Plots are indicating the potential presence of a polynomial component. Hence one might try to add one by doing the following:

lr, ps = regress(@formula(y ~ 1 + x^3 ), vdf, "all", 
    req_stats=["default", "vif", "AIC"], 
    plot_args=Dict("plot_width" => 200 ))

Giving:

Model definition:       y ~ 1 + :(x ^ 3)
Used observations:      101
Model statistics:
  R²: 0.979585                  Adjusted R²: 0.979379
  MSE: 466.724                  RMSE: 21.6038
  σ̂²: 466.724                   AIC: 622.699
Confidence interval: 95%

Coefficients statistics:
Terms ╲ Stats │       Coefs      Std err            t     Pr(>|t|)       low ci      high ci          VIF
──────────────┼──────────────────────────────────────────────────────────────────────────────────────────
(Intercept)   │     1.23626      2.65774     0.465157     0.642841     -4.03726      6.50979          0.0
x ^ 3         │     1.04075    0.0151001      68.9236  1.77641e-85      1.01079      1.07071          1.0

Overview Plots

Further, in addition to the diagnostic plots helping confirm if the residuals are normally distributed, a few tests can be requested:

# Data simulation
f(x) = @. (x^3 + 2.2345x - 1.2345 + rand(Uniform(0, 20)))
xs = [x for x in -2:0.001:8]
ys = f(xs)
vdf = DataFrame(y=ys, x=xs)

lr = regress(@formula(y ~ 1 + x^3 ), vdf, 
    req_stats=["default", "vif", "AIC", "diag_normality"])

Giving:

Model definition:       y ~ 1 + :(x ^ 3)
Used observations:      10001
Model statistics:
  R²: 0.997951                  Adjusted R²: 0.997951
  MSE: 43.4392                  RMSE: 6.59084
  σ̂²: 43.4392                   AIC: 37719.4
Confidence interval: 95%

Coefficients statistics:
Terms ╲ Stats │       Coefs      Std err            t     Pr(>|t|)       low ci      high ci          VIF
──────────────┼──────────────────────────────────────────────────────────────────────────────────────────
(Intercept)   │     11.3151    0.0815719      138.714          0.0      11.1552       11.475          0.0
x ^ 3         │     1.03984  0.000471181      2206.87          0.0      1.03892      1.04076          1.0

Diagnostic Tests:

Kolmogorov-Smirnov test (Normality of residuals):
  KS statistic: 3.47709    observations: 10001    p-value: 0.0
  with 95.0% confidence: reject null hyposthesis.
Anderson–Darling test (Normality of residuals):
  A² statistic: 24.924901    observations: 10001    p-value: 0.0
  with 95.0% confidence: reject null hyposthesis.
Jarque-Bera test (Normality of residuals):
  JB statistic: 241.764504    observations: 10001    p-value: 0.0
  with 95.0% confidence: reject null hyposthesis.

Here is how to request the robust covariance estimators:

lr = regress(@formula(y ~ 1 + x^3 ), vdf, cov=["white", "nw"])

Giving:

Model definition:       y ~ 1 + :(x ^ 3)
Used observations:      101
Model statistics:
  R²: 0.979585                  Adjusted R²: 0.979379
  MSE: 466.724                  RMSE: 21.6038
Confidence interval: 95%

White's covariance estimator (HC3):
Terms ╲ Stats │       Coefs      Std err            t     Pr(>|t|)       low ci      high ci
──────────────┼─────────────────────────────────────────────────────────────────────────────
(Intercept)   │     1.23626      2.66559     0.463785      0.64382     -4.05285      6.52538
x ^ 3         │     1.04075    0.0145322      71.6169  4.30034e-87      1.01192      1.06959

Newey-West's covariance estimator:
Terms ╲ Stats │       Coefs      Std err            t     Pr(>|t|)       low ci      high ci
──────────────┼─────────────────────────────────────────────────────────────────────────────
(Intercept)   │     1.23626       2.4218     0.510472     0.610857     -3.56912      6.04165
x ^ 3         │     1.04075    0.0129463      80.3897  5.60424e-92      1.01506      1.06644

Finally if you would like more examples I encourage you to go to the documentation as it gives a few more examples.

Notable changes since version 0.72

  • Addition of cross-validation features: K-fold cross-validation and the PRESS statistic at the model level.
  • Addition of several model statistics: Type 1 and 2 SS, squared partial correlation coefficients based on Type 1 SS and Type 2 SS, squared semi-partial correlation coefficients based on Type 1 and 2 SS.
  • additional diagnostic using the frequency tables to identify a set of categories with no observation.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].