All Projects → sylvaticus → BetaML.jl

sylvaticus / BetaML.jl

Licence: MIT license
Beta Machine Learning Toolkit

Programming Languages

julia
2034 projects

Projects that are alternatives of or similar to BetaML.jl

Mlr
Machine Learning in R
Stars: ✭ 1,542 (+2309.38%)
Mutual labels:  clustering, regression
Uci Ml Api
Simple API for UCI Machine Learning Dataset Repository (search, download, analyze)
Stars: ✭ 190 (+196.88%)
Mutual labels:  clustering, regression
Tiny ml
numpy 实现的 周志华《机器学习》书中的算法及其他一些传统机器学习算法
Stars: ✭ 129 (+101.56%)
Mutual labels:  clustering, regression
Ml
A high-level machine learning and deep learning library for the PHP language.
Stars: ✭ 1,270 (+1884.38%)
Mutual labels:  clustering, regression
scicloj.ml
A Clojure machine learning library
Stars: ✭ 152 (+137.5%)
Mutual labels:  clustering, regression
Machine learning code
机器学习与深度学习算法示例
Stars: ✭ 88 (+37.5%)
Mutual labels:  clustering, regression
Machine Learning With Python
Practice and tutorial-style notebooks covering wide variety of machine learning techniques
Stars: ✭ 2,197 (+3332.81%)
Mutual labels:  clustering, regression
Smile
Statistical Machine Intelligence & Learning Engine
Stars: ✭ 5,412 (+8356.25%)
Mutual labels:  clustering, regression
ml-book
Codice sorgente ed Errata Corrige del mio libro "A tu per tu col Machine Learning"
Stars: ✭ 16 (-75%)
Mutual labels:  clustering, regression
Machine-Learning-Algorithms
All Machine Learning Algorithms
Stars: ✭ 24 (-62.5%)
Mutual labels:  clustering, regression
Mlj.jl
A Julia machine learning framework
Stars: ✭ 982 (+1434.38%)
Mutual labels:  clustering, regression
R-stats-machine-learning
Misc Statistics and Machine Learning codes in R
Stars: ✭ 33 (-48.44%)
Mutual labels:  clustering, regression
Tribuo
Tribuo - A Java machine learning library
Stars: ✭ 882 (+1278.13%)
Mutual labels:  clustering, regression
Neuroflow
Artificial Neural Networks for Scala
Stars: ✭ 105 (+64.06%)
Mutual labels:  clustering, regression
Machine Learning Octave
🤖 MatLab/Octave examples of popular machine learning algorithms with code examples and mathematics being explained
Stars: ✭ 637 (+895.31%)
Mutual labels:  clustering, regression
Machine Learning Projects
This repository consists of all my Machine Learning Projects.
Stars: ✭ 135 (+110.94%)
Mutual labels:  clustering, regression
R
All Algorithms implemented in R
Stars: ✭ 294 (+359.38%)
Mutual labels:  clustering, regression
Tensorflow Book
Accompanying source code for Machine Learning with TensorFlow. Refer to the book for step-by-step explanations.
Stars: ✭ 4,448 (+6850%)
Mutual labels:  clustering, regression
Orange3
🍊 📊 💡 Orange: Interactive data analysis
Stars: ✭ 3,152 (+4825%)
Mutual labels:  clustering, regression
FixedEffectjlr
R interface for Fixed Effect Models
Stars: ✭ 20 (-68.75%)
Mutual labels:  clustering, regression

Beta Machine Learning Toolkit

Machine Learning made simple :-)

   

The Beta Machine Learning Toolkit is a package including many algorithms and utilities to implement machine learning workflows in Julia.

DOI Build status codecov.io

Currently the following models are available:

BetaML name MLJ Interface Category
PerceptronClassifier LinearPerceptron Supervised regressor
KernelPerceptronClassifier KernelPerceptron Supervised regressor
PegasosClassifier Pegasos Supervised classifier
DecisionTreeEstimator DecisionTreeClassifier, DecisionTreeRegressor Supervised regressor and classifier
RandomForestEstimator RandomForestClassifier, RandomForestRegressor Supervised regressor and classifier
NeuralNetworkEstimator NeuralNetworkRegressor, MultitargetNeuralNetworkRegressor, NeuralNetworkClassifier Supervised regressor and classifier
GMMRegressor1 Supervised regressor
GMMRegressor2 GaussianMixtureRegressor, MultitargetGaussianMixtureRegressor Supervised regressor
KMeansClusterer KMeans Unsupervised hard clusterer
KMedoidsClusterer KMedoids Unsupervised hard clusterer
GMMClusterer GaussianMixtureClusterer Unsupervised soft clusterer
FeatureBasedImputer SimpleImputer Unsupervised missing data imputer
GMMImputer GaussianMixtureImputer Unsupervised missing data imputer
RFImputer RandomForestImputer Unsupervised missing data imputer
UniversalImputer GeneralImputer Unsupervised missing data imputer
MinMaxScaler Data transformer
StandardScaler Data transformer
Scaler Data transformer
PCA Data transformer
OneHotEncoder Data transformer
OrdinalEncoder Data transformer
ConfusionMatrix Predictions assessment

Theoretical notes describing many of these algorithms are at the companion repository https://github.com/sylvaticus/MITx_6.86x.

All models are implemented entirely in Julia and are hosted in the repository itself (i.e. they are not wrapper to third-party models). If your favorite option or model is missing, you can try implement it yourself and open a pull request to share it (see the section Contribute below) or request its implementation (open an issue). Thanks to its JIT compiler, Julia is indeed in the sweet spot where we can easily write models in a high-level language and still having them running efficiently.

Documentation

Please refer to the package documentation or use the Julia inline package system (just press the question mark ? and then, on the special help prompt help?>, type the module or function name). The package documentation is made of two distinct parts. The first one is an extensively commented tutorial that covers most of the library, the second one is the reference manual covering the library's API.

If you are looking for an introductory material on Julia, have a look on the book "Julia Quick Syntax Reference"(Apress,2019) or the online course "Scientific Programming and Machine Learning in Julia.

While implemented in Julia, this package can be easily used in R or Python employing JuliaCall or PyJulia respectively, see the relevant section in the documentation.

Examples

  • Using an Artificial Neural Network for multinomial categorisation

In this example we see how to train a neural networks model to predict the specie's name (5th column) given floral sepals and petals measures (first 4 columns) in the famous iris flower dataset.

# Load Modules
using DelimitedFiles, Random
using Pipe, Plots, BetaML # Load BetaML and other auxiliary modules
Random.seed!(123);  # Fix the random seed (to obtain reproducible results).

# Load the data
iris     = readdlm(joinpath(dirname(Base.find_package("BetaML")),"..","test","data","iris.csv"),',',skipstart=1)
x        = convert(Array{Float64,2}, iris[:,1:4])
y        = convert(Array{String,1}, iris[:,5])
# Encode the categories (levels) of y using a separate column per each category (aka "one-hot" encoding) 
ohmod    = OneHotEncoder()
y_oh     = fit!(ohmod,y) 
# Split the data in training/testing sets
((xtrain,xtest),(ytrain,ytest),(ytrain_oh,ytest_oh)) = partition([x,y,y_oh],[0.8,0.2])
(ntrain, ntest) = size.([xtrain,xtest],1)

# Define the Artificial Neural Network model
l1   = DenseLayer(4,10,f=relu) # The activation function is `ReLU`
l2   = DenseLayer(10,3)        # The activation function is `identity` by default
l3   = VectorFunctionLayer(3,f=softmax) # Add a (parameterless) layer whose activation function (`softmax` in this case) is defined to all its nodes at once
mynn = NeuralNetworkEstimator(layers=[l1,l2,l3],loss=crossentropy,descr="Multinomial logistic regression Model Sepal", batch_size=2, epochs=200) # Build the NN and use the cross-entropy as error function. Swith to auto-tuning with `autotune=true`

# Train the model (using the ADAM optimizer by default)
res = fit!(mynn,fit!(Scaler(),xtrain),ytrain_oh) # Fit the model to the (scaled) data

# Obtain predictions and test them against the ground true observations
ŷtrain         = @pipe predict(mynn,fit!(Scaler(),xtrain)) |> inverse_predict(ohmod,_)  # Note the scaling and reverse one-hot encoding functions
ŷtest          = @pipe predict(mynn,fit!(Scaler(),xtest))  |> inverse_predict(ohmod,_) 
train_accuracy = accuracy(ytrain,ŷtrain) # 0.975
test_accuracy  = accuracy(ytest,ŷtest)   # 0.96

# Analyse model performances
cm = ConfusionMatrix()
fit!(cm,ytest,ŷtest)
print(cm)
A ConfusionMatrix BetaMLModel (fitted)

-----------------------------------------------------------------

*** CONFUSION MATRIX ***

Scores actual (rows) vs predicted (columns):

4×4 Matrix{Any}:
 "Labels"       "virginica"    "versicolor"   "setosa"
 "virginica"   8              1              0
 "versicolor"  0             14              0
 "setosa"      0              0              7
Normalised scores actual (rows) vs predicted (columns):

4×4 Matrix{Any}:
 "Labels"       "virginica"   "versicolor"   "setosa"
 "virginica"   0.888889      0.111111       0.0
 "versicolor"  0.0           1.0            0.0
 "setosa"      0.0           0.0            1.0

 *** CONFUSION REPORT ***

- Accuracy:               0.9666666666666667
- Misclassification rate: 0.033333333333333326
- Number of classes:      3

  N Class      precision   recall  specificity  f1score  actual_count  predicted_count
                             TPR       TNR                 support                  

  1 virginica      1.000    0.889        1.000    0.941            9               8
  2 versicolor     0.933    1.000        0.938    0.966           14              15
  3 setosa         1.000    1.000        1.000    1.000            7               7

- Simple   avg.    0.978    0.963        0.979    0.969
- Weigthed avg.    0.969    0.967        0.971    0.966
ϵ = info(mynn)["lossPerEpoch"]
plot(1:length(ϵ),ϵ, ylabel="epochs",xlabel="error",legend=nothing,title="Avg. error per epoch on the Sepal dataset")
heatmap(info(cm)["categories"],info(cm)["categories"],info(cm)["normalised_scores"],c=cgrad([:white,:blue]),xlabel="Predicted",ylabel="Actual", title="Confusion Matrix")

  • Other examples

Further examples, with more models and more advanced techniques in order to improve predictions, are provided in the documentation tutorial. At the opposite, very "micro" examples of usage of the various functions can be studied in the unit-tests available in the test folder.

Limitations and alternative packages

The focus of the library is skewed toward user-friendliness rather than computational efficiency. While the code is (relatively) easy to read, it is not heavily optimised, and currently all models operate on the CPU and only with data that fits in the pc's memory. For very large data we suggest specialised packages. See the list below.

Category Packages
ML toolkits/pipelines ScikitLearn.jl, AutoMLPipeline.jl, MLJ.jl
Neural Networks Flux.jl, Knet
Decision Trees DecisionTree.jl
Clustering Clustering.jl, GaussianMixtures.jl
Missing imputation Impute.jl

TODO

Short term

  • Implement autotuning of GMMClusterer using BIC or AIC

Mid/Long term

  • Add convolutional layers and RNN support
  • Reinforcement learning (Markov decision processes)

Contribute

Contributions to the library are welcome. We are particularly interested in the areas covered in the "TODO" list above, but we are open to other areas as well. Please however consider that the focus is mostly didactic/research, so clear, easy to read (and well documented) code and simple API with reasonable defaults are more important that highly optimised algorithms. For the same reason, it is fine to use verbose names. Please open an issue to discuss your ideas or make directly a well-documented pull request to the repository. While not required by any means, if you are customising BetaML and writing for example your own neural network layer type (by subclassing AbstractLayer), your own sampler (by subclassing AbstractDataSampler) or your own mixture component (by subclassing AbstractMixture), please consider to give it back to the community and open a pull request to integrate them in BetaML.

Citations

If you use BetaML please cite it as:

  • Lobianco, A., (2021). BetaML: The Beta Machine Learning Toolkit, a self-contained repository of Machine Learning algorithms in Julia. Journal of Open Source Software, 6(60), 2849, https://doi.org/10.21105/joss.02849
@article{Lobianco2021,
  doi       = {10.21105/joss.02849},
  url       = {https://doi.org/10.21105/joss.02849},
  year      = {2021},
  publisher = {The Open Journal},
  volume    = {6},
  number    = {60},
  pages     = {2849},
  author    = {Antonello Lobianco},
  title     = {BetaML: The Beta Machine Learning Toolkit, a self-contained repository of Machine Learning algorithms in Julia},
  journal   = {Journal of Open Source Software}
}

Acknowledgements

The development of this package at the Bureau d'Economie Théorique et Appliquée (BETA, Nancy) was supported by the French National Research Agency through the Laboratory of Excellence ARBRE, a part of the “Investissements d'Avenir” Program (ANR 11 – LABX-0002-01).

BLogos

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].