All Projects → vascokk → Numer

vascokk / Numer

Numeric Erlang - vector and matrix operations with CUDA. Heavily inspired by Pteracuda - https://github.com/kevsmith/pteracuda

Programming Languages

erlang
1774 projects

Projects that are alternatives of or similar to Numer

Heteroflow
Concurrent CPU-GPU Programming using Task Models
Stars: ✭ 57 (-37.36%)
Mutual labels:  gpu, cuda
Ggnn
GGNN: State of the Art Graph-based GPU Nearest Neighbor Search
Stars: ✭ 63 (-30.77%)
Mutual labels:  gpu, cuda
Optix Path Tracer
OptiX Path Tracer
Stars: ✭ 60 (-34.07%)
Mutual labels:  gpu, cuda
Nvidia libs test
Tests and benchmarks for cudnn (and in the future, other nvidia libraries)
Stars: ✭ 36 (-60.44%)
Mutual labels:  gpu, cuda
Deeppipe2
Deep Learning library using GPU(CUDA/cuBLAS)
Stars: ✭ 90 (-1.1%)
Mutual labels:  gpu, cuda
Qualia2.0
Qualia is a deep learning framework deeply integrated with automatic differentiation and dynamic graphing with CUDA acceleration. Qualia was built from scratch.
Stars: ✭ 41 (-54.95%)
Mutual labels:  gpu, cuda
Tsne Cuda
GPU Accelerated t-SNE for CUDA with Python bindings
Stars: ✭ 1,120 (+1130.77%)
Mutual labels:  gpu, cuda
Graphvite
GraphVite: A General and High-performance Graph Embedding System
Stars: ✭ 865 (+850.55%)
Mutual labels:  gpu, cuda
Cudart.jl
Julia wrapper for CUDA runtime API
Stars: ✭ 75 (-17.58%)
Mutual labels:  gpu, cuda
Parenchyma
An extensible HPC framework for CUDA, OpenCL and native CPU.
Stars: ✭ 71 (-21.98%)
Mutual labels:  gpu, cuda
Deep Learning Boot Camp
A community run, 5-day PyTorch Deep Learning Bootcamp
Stars: ✭ 1,270 (+1295.6%)
Mutual labels:  gpu, cuda
Mpr
Reference implementation for "Massively Parallel Rendering of Complex Closed-Form Implicit Surfaces" (SIGGRAPH 2020)
Stars: ✭ 84 (-7.69%)
Mutual labels:  gpu, cuda
Cuda
Experiments with CUDA and Rust
Stars: ✭ 31 (-65.93%)
Mutual labels:  gpu, cuda
Carlsim3
CARLsim is an efficient, easy-to-use, GPU-accelerated software framework for simulating large-scale spiking neural network (SNN) models with a high degree of biological detail.
Stars: ✭ 52 (-42.86%)
Mutual labels:  gpu, cuda
Cub
Cooperative primitives for CUDA C++.
Stars: ✭ 883 (+870.33%)
Mutual labels:  gpu, cuda
Pycuda
CUDA integration for Python, plus shiny features
Stars: ✭ 1,112 (+1121.98%)
Mutual labels:  gpu, cuda
Wheels
Performance-optimized wheels for TensorFlow (SSE, AVX, FMA, XLA, MPI)
Stars: ✭ 891 (+879.12%)
Mutual labels:  gpu, cuda
Neanderthal
Fast Clojure Matrix Library
Stars: ✭ 927 (+918.68%)
Mutual labels:  gpu, cuda
Arboretum
Gradient Boosting powered by GPU(NVIDIA CUDA)
Stars: ✭ 64 (-29.67%)
Mutual labels:  gpu, cuda
Cuda Design Patterns
Some CUDA design patterns and a bit of template magic for CUDA
Stars: ✭ 78 (-14.29%)
Mutual labels:  gpu, cuda

This is a collection of Erlang NIF functions for BLAS operations on vectors and matrices with CUDA. Both are natively implemented as Thrust host/device vectors and special "buffer" classes are used to transfer them from Erlang to CUDA and back.

Installation on Windows x64

git clone git://github.com/vascokk/NumEr.git
cd NumEr

All the commands from this point forward should be executed in a VC++ 10.0 command-line window

set TARGET_ARCH=x64

Make sure you have the following bat file:

C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin\amd64\vcvars64.bat

with this line inside:

call "C:\Program Files\Microsoft SDKs\Windows\v7.1\Bin\SetEnv.cmd" /x64

Execute the above bat file and compile:

mkdir priv
rebar compile
rebar eunit suites=numer_helpers_tests

You should see:

==> numer (eunit)
======================== EUnit ========================
module 'numer_helpers_tests'
  numer_helpers_tests: gemm_test...[0.421 s] ok
  numer_helpers_tests: gemm_2_test...[0.375 s] ok
  numer_helpers_tests: sum_by_cols_test...[0.375 s] ok
  numer_helpers_tests: gemv_test...[0.437 s] ok
  numer_helpers_tests: gemv_2_test...[0.421 s] ok
  numer_helpers_tests: gemv_3_test...[0.405 s] ok
  numer_helpers_tests: saxpy_test...[0.390 s] ok
  numer_helpers_tests: smm_test...[0.390 s] ok
  numer_helpers_tests: m2v_test...ok
  numer_helpers_tests: v2m_test...ok
  numer_helpers_tests: transpose_test...[0.390 s] ok
  numer_helpers_tests: sigmoid_test...[0.390 s] ok
  numer_helpers_tests: sigmoid_2_test...[0.453 s] ok
  numer_helpers_tests: tanh_test...[0.390 s] ok
  numer_helpers_tests: tanh_2_test...[0.406 s] ok
  numer_helpers_tests: log_test...[0.390 s] ok
  numer_helpers_tests: log_2_test...[0.390 s] ok
  numer_helpers_tests: ones_test...ok
  numer_helpers_tests: ones_2_test...ok
  numer_helpers_tests: zeros_test...ok
  numer_helpers_tests: zeros_2_test...ok
  [done in 6.349 s]
=======================================================
  All 21 tests passed.

Mac OS X

Strictly follow NVIDIA Mac OS X Getting Started and set env variables:

export PATH=/Developer/NVIDIA/CUDA-5.0/bin:$PATH
export DYLD_LIBRARY_PATH=/Developer/NVIDIA/CUDA-5.0/lib:$DYLD_LIBRARY_PATH

Compile and run eunit:

mkdir priv
./rebar compile
./rebar eunit suites=numer_helpers_tests

TODO: Linux

Operations with vectors and matrices

% this is a row-major matrix:
A = [[4.0,6.0,8.0,2.0],[5.0,7.0,9.0,3.0]].

%this is a vector:
X = [2.0,5.0,1.0,7.0].

% create a CUDA context and transfer to "buffers"
{ok, Ctx} = numer_nifs:new_context().
{ok, Buf_A} = numer_nifs:new_matrix_float_buffer(Ctx, A, ?ROW_MAJOR).
{ok, Buf_X} = numer_nifs:new_float_buffer(Ctx).
numer_nifs:write_buffer(Buf_X, X).

As you see one of the parameters in the matrix buffer is "?ROW_MAJOR". It is kinda borrowed from Boost library, but not yet fully implemented in NumEr. Currently only row-major matrices are supported. However, under the hood in the Thrust vectors the numbers are stored in column-major format. I chose to do it in this way, because the CUBLAS library is using column-major storage - being a derivative of the FORTRAN BLAS library.

There are several modules, which are wrappers for the NIF functions, like: numer_blas.erl - for BLAS operations, numer_buffer.erl - for operations with buffers (new, delete, read, write), etc.

Using numer_buffer module, the above example will look like:

 {ok, Ctx} = numer_context:new().
 {ok, Buf_A} = numer_buffer:new(Ctx, matrix, float, row_major, A).
 {ok, Buf_X} = numer_buffer:new(Ctx, float).
 numer_buffer:write(Buf_X, X).

BLAS GEMV example:

%  GEMV: y <- α op ( A ) x + β y
gemv_test()->
    {ok, Ctx} = numer_context:new(),
    A = [[4.0,6.0,8.0,2.0],[5.0,7.0,9.0,3.0]],
    _m = 2, %rows A
    _n = 4, %columns A
    _alpha = 1.0,
    _beta = 0.0,
    X = [2.0,5.0,1.0,7.0],
    Y = [0.0, 0.0], 
    {ok, Buf_A} = numer_buffer:new(Ctx, matrix, float, row_major, A),
    {ok, Buf_X} = numer_buffer:new(Ctx, float),
    numer_buffer:write(Buf_X, X),
    {ok, Buf_Y} = numer_buffer:new(Ctx, float),
    numer_buffer:write(Buf_Y, Y),
    ok = numer_blas:gemv(Ctx, no_transpose , _m, _n, _alpha, Buf_A, Buf_X, _beta, Buf_Y),
    {ok, [60.0,75.0]} = numer_buffer:read(Buf_Y),
    ok = numer_buffer:destroy(Buf_A),
    ok = numer_buffer:destroy(Buf_X),
    ok = numer_buffer:destroy(Buf_Y),
    ok = numer_context:destroy(Ctx).

Using "helpers" module

Since using buffer operations can make the code awkward to read, there is also a helper module - numer_helpers.erl, wich can be used for prototyping the algorithms. WARNING - do not use this module in iterative algorithms (e.g. Machine Learning). Use it for prototyping or one-off calculations with big matrices/vectors. Here is how:

gemv_2_test()->
    A = [[4.0,6.0,8.0,2.0],[5.0,7.0,9.0,3.0]],
    _alpha = 1.0,
    _beta = 0.0,
    X = [2.0,5.0,1.0,7.0],
    Res = numer_helpers:gemv(no_transpose , _alpha, A, X, _beta, []),
    ?assertEqual([60.0,75.0], Res).

It is much more readable and useful for one-off calculations, but in the ML "training" stage (with hundreds of iterations) it will be unusable, due to the multiple buffer transfers.

Machine Learning with Erlang & CUDA - "Logistic Regression"

There is an implementation of the Logistic Regression (without regularization) learning function with Gradient Descent optimization. Take a look at learn_buf() and gradient_descent() in the numer_logreg.erl module and run the eunit test:

Windows 7, GPU - Quadro FX 1800M 1GB:

rebar eunit suites=numer_logreg_tests tests=learn_buf2_test

NOTICE: Using experimental option 'tests'
    Running test function(s):
      numer_logreg_tests:learn_buf2_test/0
======================== EUnit ========================
numer_logreg_tests: learn_buf2_test...test/numer_logreg_tests.erl:108:<0.187.0>:
 Learned:[-0.02788488380610943,0.010618738830089569,6.68175402097404e-4]
[3.557 s] ok
=======================================================
  Test passed.

In learn_buf2_test() all the "buffers" needed are created upfront and passed to the NIFs in order to avoid multiple buffer creations and transfers during the iterations.

The same test with MacBook Pro, with GeForce 9400M 256 MB:

NOTICE: Using experimental option 'tests'
    Running test function(s):
      numer_logreg_tests:learn_buf2_test/0
======================== EUnit ========================
numer_logreg_tests: learn_buf2_test...test/numer_logreg_tests.erl:108:<0.197.0>: 
 Learned:[-0.02788488380610943,0.010618738830089569,6.68175402097404e-4]
[0.430 s] ok
=======================================================
  Test passed.

The numer_ml.erl module contains a C++ implementation (via single NIF function) of Logistic Regression, while the numer_logreg.erl is using numer_blas.erl module. The first one I used to compare the speed between the "native" and "using NumEr modules" implementations. There is considerable difference between the two on Windows (using NumEr modules - 3.5 sec, using single NIF - under 1 sec) and almost no difference on Mac (both - under 0.5 sec).

The project is still a work in progress and needs a lot of polishing and if anyone is willing to give a hand I'll be more than happy. Any suggestions to improve the framework are also very welcome.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].