Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Qualia is a deep learning framework deeply integrated with automatic differentiation and dynamic graphing with CUDA acceleration. Qualia was built from scratch.

Stars: ✭ 41 (-54.95%)

Mutual labels: gpu, cuda

Tsne Cuda

GPU Accelerated t-SNE for CUDA with Python bindings

Stars: ✭ 1,120 (+1130.77%)

Mutual labels: gpu, cuda

Graphvite

GraphVite: A General and High-performance Graph Embedding System

Stars: ✭ 865 (+850.55%)

Mutual labels: gpu, cuda

Cudart.jl

Julia wrapper for CUDA runtime API

Stars: ✭ 75 (-17.58%)

Mutual labels: gpu, cuda

Parenchyma

An extensible HPC framework for CUDA, OpenCL and native CPU.

Stars: ✭ 71 (-21.98%)

Mutual labels: gpu, cuda

Deep Learning Boot Camp

A community run, 5-day PyTorch Deep Learning Bootcamp

Stars: ✭ 1,270 (+1295.6%)

Mutual labels: gpu, cuda

Mpr

Reference implementation for "Massively Parallel Rendering of Complex Closed-Form Implicit Surfaces" (SIGGRAPH 2020)

Stars: ✭ 84 (-7.69%)

Mutual labels: gpu, cuda

Cuda

Experiments with CUDA and Rust

Stars: ✭ 31 (-65.93%)

Mutual labels: gpu, cuda

Carlsim3

CARLsim is an efficient, easy-to-use, GPU-accelerated software framework for simulating large-scale spiking neural network (SNN) models with a high degree of biological detail.

Stars: ✭ 52 (-42.86%)

Mutual labels: gpu, cuda

Cub

Cooperative primitives for CUDA C++.

Stars: ✭ 883 (+870.33%)

Mutual labels: gpu, cuda

Pycuda

CUDA integration for Python, plus shiny features

Stars: ✭ 1,112 (+1121.98%)

Mutual labels: gpu, cuda

Wheels

Performance-optimized wheels for TensorFlow (SSE, AVX, FMA, XLA, MPI)

Stars: ✭ 891 (+879.12%)

Mutual labels: gpu, cuda

Neanderthal

Fast Clojure Matrix Library

Stars: ✭ 927 (+918.68%)

Mutual labels: gpu, cuda

Arboretum

Gradient Boosting powered by GPU(NVIDIA CUDA)

Stars: ✭ 64 (-29.67%)

Mutual labels: gpu, cuda

Cuda Design Patterns

Some CUDA design patterns and a bit of template magic for CUDA

Stars: ✭ 78 (-14.29%)

Mutual labels: gpu, cuda

View All Similar Projects ➔

This is a collection of Erlang NIF functions for BLAS operations on vectors and matrices with CUDA. Both are natively implemented as Thrust host/device vectors and special "buffer" classes are used to transfer them from Erlang to CUDA and back.

Installation on Windows x64

git clone git://github.com/vascokk/NumEr.git
cd NumEr

All the commands from this point forward should be executed in a VC++ 10.0 command-line window

set TARGET_ARCH=x64

Make sure you have the following bat file:

C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin\amd64\vcvars64.bat

with this line inside:

call "C:\Program Files\Microsoft SDKs\Windows\v7.1\Bin\SetEnv.cmd" /x64

Execute the above bat file and compile:

mkdir priv
rebar compile
rebar eunit suites=numer_helpers_tests

You should see:

==> numer (eunit)
======================== EUnit ========================
module 'numer_helpers_tests'
  numer_helpers_tests: gemm_test...[0.421 s] ok
  numer_helpers_tests: gemm_2_test...[0.375 s] ok
  numer_helpers_tests: sum_by_cols_test...[0.375 s] ok
  numer_helpers_tests: gemv_test...[0.437 s] ok
  numer_helpers_tests: gemv_2_test...[0.421 s] ok
  numer_helpers_tests: gemv_3_test...[0.405 s] ok
  numer_helpers_tests: saxpy_test...[0.390 s] ok
  numer_helpers_tests: smm_test...[0.390 s] ok
  numer_helpers_tests: m2v_test...ok
  numer_helpers_tests: v2m_test...ok
  numer_helpers_tests: transpose_test...[0.390 s] ok
  numer_helpers_tests: sigmoid_test...[0.390 s] ok
  numer_helpers_tests: sigmoid_2_test...[0.453 s] ok
  numer_helpers_tests: tanh_test...[0.390 s] ok
  numer_helpers_tests: tanh_2_test...[0.406 s] ok
  numer_helpers_tests: log_test...[0.390 s] ok
  numer_helpers_tests: log_2_test...[0.390 s] ok
  numer_helpers_tests: ones_test...ok
  numer_helpers_tests: ones_2_test...ok
  numer_helpers_tests: zeros_test...ok
  numer_helpers_tests: zeros_2_test...ok
  [done in 6.349 s]
=======================================================
  All 21 tests passed.

Mac OS X

Strictly follow NVIDIA Mac OS X Getting Started and set env variables:

export PATH=/Developer/NVIDIA/CUDA-5.0/bin:$PATH
export DYLD_LIBRARY_PATH=/Developer/NVIDIA/CUDA-5.0/lib:$DYLD_LIBRARY_PATH

Compile and run eunit:

mkdir priv
./rebar compile
./rebar eunit suites=numer_helpers_tests

TODO: Linux

Operations with vectors and matrices

% this is a row-major matrix:
A = [[4.0,6.0,8.0,2.0],[5.0,7.0,9.0,3.0]].

%this is a vector:
X = [2.0,5.0,1.0,7.0].

% create a CUDA context and transfer to "buffers"
{ok, Ctx} = numer_nifs:new_context().
{ok, Buf_A} = numer_nifs:new_matrix_float_buffer(Ctx, A, ?ROW_MAJOR).
{ok, Buf_X} = numer_nifs:new_float_buffer(Ctx).
numer_nifs:write_buffer(Buf_X, X).

As you see one of the parameters in the matrix buffer is "?ROW_MAJOR". It is kinda borrowed from Boost library, but not yet fully implemented in NumEr. Currently only row-major matrices are supported. However, under the hood in the Thrust vectors the numbers are stored in column-major format. I chose to do it in this way, because the CUBLAS library is using column-major storage - being a derivative of the FORTRAN BLAS library.

There are several modules, which are wrappers for the NIF functions, like: numer_blas.erl - for BLAS operations, numer_buffer.erl - for operations with buffers (new, delete, read, write), etc.

Using numer_buffer module, the above example will look like:

 {ok, Ctx} = numer_context:new().
 {ok, Buf_A} = numer_buffer:new(Ctx, matrix, float, row_major, A).
 {ok, Buf_X} = numer_buffer:new(Ctx, float).
 numer_buffer:write(Buf_X, X).

BLAS GEMV example:

%  GEMV: y <- α op ( A ) x + β y
gemv_test()->
    {ok, Ctx} = numer_context:new(),
    A = [[4.0,6.0,8.0,2.0],[5.0,7.0,9.0,3.0]],
    _m = 2, %rows A
    _n = 4, %columns A
    _alpha = 1.0,
    _beta = 0.0,
    X = [2.0,5.0,1.0,7.0],
    Y = [0.0, 0.0], 
    {ok, Buf_A} = numer_buffer:new(Ctx, matrix, float, row_major, A),
    {ok, Buf_X} = numer_buffer:new(Ctx, float),
    numer_buffer:write(Buf_X, X),
    {ok, Buf_Y} = numer_buffer:new(Ctx, float),
    numer_buffer:write(Buf_Y, Y),
    ok = numer_blas:gemv(Ctx, no_transpose , _m, _n, _alpha, Buf_A, Buf_X, _beta, Buf_Y),
    {ok, [60.0,75.0]} = numer_buffer:read(Buf_Y),
    ok = numer_buffer:destroy(Buf_A),
    ok = numer_buffer:destroy(Buf_X),
    ok = numer_buffer:destroy(Buf_Y),
    ok = numer_context:destroy(Ctx).

Using "helpers" module

Since using buffer operations can make the code awkward to read, there is also a helper module - numer_helpers.erl, wich can be used for prototyping the algorithms. WARNING - do not use this module in iterative algorithms (e.g. Machine Learning). Use it for prototyping or one-off calculations with big matrices/vectors. Here is how:

gemv_2_test()->
    A = [[4.0,6.0,8.0,2.0],[5.0,7.0,9.0,3.0]],
    _alpha = 1.0,
    _beta = 0.0,
    X = [2.0,5.0,1.0,7.0],
    Res = numer_helpers:gemv(no_transpose , _alpha, A, X, _beta, []),
    ?assertEqual([60.0,75.0], Res).

It is much more readable and useful for one-off calculations, but in the ML "training" stage (with hundreds of iterations) it will be unusable, due to the multiple buffer transfers.

Machine Learning with Erlang & CUDA - "Logistic Regression"

There is an implementation of the Logistic Regression (without regularization) learning function with Gradient Descent optimization. Take a look at learn_buf() and gradient_descent() in the numer_logreg.erl module and run the eunit test:

Windows 7, GPU - Quadro FX 1800M 1GB:

rebar eunit suites=numer_logreg_tests tests=learn_buf2_test

NOTICE: Using experimental option 'tests'
    Running test function(s):
      numer_logreg_tests:learn_buf2_test/0
======================== EUnit ========================
numer_logreg_tests: learn_buf2_test...test/numer_logreg_tests.erl:108:<0.187.0>:
 Learned:[-0.02788488380610943,0.010618738830089569,6.68175402097404e-4]
[3.557 s] ok
=======================================================
  Test passed.

In learn_buf2_test() all the "buffers" needed are created upfront and passed to the NIFs in order to avoid multiple buffer creations and transfers during the iterations.

The same test with MacBook Pro, with GeForce 9400M 256 MB:

NOTICE: Using experimental option 'tests'
    Running test function(s):
      numer_logreg_tests:learn_buf2_test/0
======================== EUnit ========================
numer_logreg_tests: learn_buf2_test...test/numer_logreg_tests.erl:108:<0.197.0>: 
 Learned:[-0.02788488380610943,0.010618738830089569,6.68175402097404e-4]
[0.430 s] ok
=======================================================
  Test passed.

The numer_ml.erl module contains a C++ implementation (via single NIF function) of Logistic Regression, while the numer_logreg.erl is using numer_blas.erl module. The first one I used to compare the speed between the "native" and "using NumEr modules" implementations. There is considerable difference between the two on Windows (using NumEr modules - 3.5 sec, using single NIF - under 1 sec) and almost no difference on Mac (both - under 0.5 sec).

The project is still a work in progress and needs a lot of polishing and if anyone is willing to give a hand I'll be more than happy. Any suggestions to improve the framework are also very welcome.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 91

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗