All Projects → fml-fam → fml

fml-fam / fml

Licence: BSL-1.0 license
Fused Matrix Library

Programming Languages

C++
36643 projects - #6 most used programming language
Cuda
1817 projects
c
50402 projects - #5 most used programming language
Makefile
30231 projects
Singularity
16 projects
Dockerfile
14818 projects

Projects that are alternatives of or similar to fml

Armadillo Code
Armadillo: fast C++ library for linear algebra & scientific computing - http://arma.sourceforge.net
Stars: ✭ 388 (+1516.67%)
Mutual labels:  hpc, matrix, linear-algebra, blas
monolish
monolish: MONOlithic LInear equation Solvers for Highly-parallel architecture
Stars: ✭ 166 (+591.67%)
Mutual labels:  hpc, matrix, linear-algebra, blas
dbcsr
DBCSR: Distributed Block Compressed Sparse Row matrix library
Stars: ✭ 65 (+170.83%)
Mutual labels:  hpc, linear-algebra, mpi, blas
Fmatvec
A fast vector/matrix library
Stars: ✭ 5 (-79.17%)
Mutual labels:  matrix, linear-algebra, blas
Vectorious
Linear algebra in TypeScript.
Stars: ✭ 616 (+2466.67%)
Mutual labels:  matrix, linear-algebra, blas
Eigen Git Mirror
THIS MIRROR IS DEPRECATED -- New url: https://gitlab.com/libeigen/eigen
Stars: ✭ 1,659 (+6812.5%)
Mutual labels:  matrix, linear-algebra, blas
Lacaml
OCaml bindings for BLAS/LAPACK (high-performance linear algebra Fortran libraries)
Stars: ✭ 101 (+320.83%)
Mutual labels:  matrix, linear-algebra, blas
float
Single precision (float) matrices for R.
Stars: ✭ 41 (+70.83%)
Mutual labels:  hpc, matrix, linear-algebra
pbdML
No description or website provided.
Stars: ✭ 13 (-45.83%)
Mutual labels:  hpc, linear-algebra, mpi
Blasjs
Pure Javascript manually written 👌 implementation of BLAS, Many numerical software applications use BLAS computations, including Armadillo, LAPACK, LINPACK, GNU Octave, Mathematica, MATLAB, NumPy, R, and Julia.
Stars: ✭ 241 (+904.17%)
Mutual labels:  matrix, linear-algebra, blas
PartitionedArrays.jl
Vectors and sparse matrices partitioned into pieces for parallel distributed-memory computations.
Stars: ✭ 45 (+87.5%)
Mutual labels:  hpc, linear-algebra, mpi
t8code
Parallel algorithms and data structures for tree-based AMR with arbitrary element shapes.
Stars: ✭ 37 (+54.17%)
Mutual labels:  hpc, mpi
Singularity-tutorial
Singularity 101
Stars: ✭ 31 (+29.17%)
Mutual labels:  hpc, mpi
linnea
Linnea is an experimental tool for the automatic generation of optimized code for linear algebra problems.
Stars: ✭ 60 (+150%)
Mutual labels:  linear-algebra, blas
pressio
Model reduction for linear and nonlinear dynamical systems: core C++ library
Stars: ✭ 35 (+45.83%)
Mutual labels:  hpc, linear-algebra
az-hop
The Azure HPC On-Demand Platform provides an HPC Cluster Ready solution
Stars: ✭ 33 (+37.5%)
Mutual labels:  hpc, mpi
arbor
The Arbor multi-compartment neural network simulation library.
Stars: ✭ 87 (+262.5%)
Mutual labels:  hpc, mpi
hpc
Learning and practice of high performance computing (CUDA, Vulkan, OpenCL, OpenMP, TBB, SSE/AVX, NEON, MPI, coroutines, etc. )
Stars: ✭ 39 (+62.5%)
Mutual labels:  hpc, mpi
gslib
sparse communication library
Stars: ✭ 22 (-8.33%)
Mutual labels:  hpc, mpi
sparse
Sparse matrix formats for linear algebra supporting scientific and machine learning applications
Stars: ✭ 136 (+466.67%)
Mutual labels:  matrix, blas

fml

fml is the Fused Matrix Library, a multi-source, header-only C++ library for dense matrix computing. The emphasis is on real-valued matrix types (float, double, and __half) for numerical operations useful for data analysis.

The goal of fml is to be "medium-level". That is, high-level compared to working directly with e.g. the BLAS or CUDA™, but low(er)-level compared to other C++ matrix frameworks. Some knowledge of the use of LAPACK will make many choices in fml make more sense.

The library provides 4 main classes: cpumat, gpumat, parmat, and mpimat. These are mostly what they sound like, but the particular details are:

  • CPU: Single node cpu computing (multi-threaded if using multi-threaded BLAS and linking with OpenMP).
  • GPU: Single gpu computing.
  • MPI: Multi-node computing via ScaLAPACK (+gpus if using SLATE).
  • PAR: Multi-node and/or multi-gpu computing.

There are some differences in how objects of any particular type are constructed. But the high level APIs are largely the same between the objects. The goal is to be able to quickly create laptop-scale prototypes that are then easily converted into large scale gpu/multi-node/multi-gpu/multi-node+multi-gpu codes.

Installation

The library is header-only so no installation is strictly necessary. You can just include a copy/submodule in your project. However, if you want some analogue of make install, then you could do something like:

ln -s ./src/fml /usr/include/

Dependencies and Other Software

There are no external header dependencies, but there are some shared libraries you need to have (more information below):

Other software we use:

  • Tests use catch2 (a copy of which is included under tests/).

You can find some examples of how to use the library in the examples/ tree. Right now there is no real build system beyond some ad hoc makefiles; but ad hoc is better than no hoc.

Depending on which class(es) you want to use, here are some general guidelines for using the library in your own project:

  • CPU: cpumat
    • Compile with your favorite C++ compiler.
    • Link with LAPACK and BLAS (and ideally with OpenMP).
  • GPU: gpumat
    • Compile with nvcc.
    • For most functionality, link with libcudart, libcublas, and libcusolver. Link with libcurand if using the random generators. Link with libnvidia-ml if using nvml (if you're only using this, then you don't need nvcc; an ordinary C++ compiler will do). If you have CUDA installed and do not know what to link with, there is no harm in linking with all of these.
  • MPI: mpimat
    • Compile with mpicxx.
    • Link with libscalapack.
  • PAR: parmat
    • Compile with mpicxx.
    • Link with CPU stuff if using parmat_cpu; link with GPU stuff if using parmat_gpu (you can use both).

Check the makefiles in the examples/ tree if none of that makes sense.

Example

Here's a simple example computing the SVD with some data held on a single CPU:

#include <fml/cpu.hh>
using namespace fml;

int main()
{
  len_t m = 3;
  len_t n = 2;
  
  cpumat<float> x(m, n);
  x.fill_linspace(1.f, (float)m*n);
  
  x.info();
  x.print(0);
  
  cpuvec<float> s;
  linalg::svd(x, s);
  
  s.info();
  s.print();
  
  return 0;
}

Save as svd.cpp and build with:

g++ -I/path/to/fml/src -fopenmp svd.cpp -o svd -llapack -lblas

You should see output like

# cpumat 3x2 type=f
1 4 
2 5 
3 6 

# cpuvec 2 type=f
9.5080 0.7729 

The API is largely the same if we change the object storage, but we have to change the object initialization. For example, if x is an object of class mpimat, we still call linalg::svd(x, s). The differences lie in the creation of the objects. Here is how we might change the above example to use distributed data:

#include <fml/mpi.hh>
using namespace fml;

int main()
{
  grid g = grid(PROC_GRID_SQUARE);
  g.info();
  
  len_t m = 3;
  len_t n = 2;
  
  mpimat<float> x(g, m, n, 1, 1);
  x.fill_linspace(1.f, (float)m*n);
  
  x.info();
  x.print(0);
  
  cpuvec<float> s;
  linalg::svd(x, s);
  
  if (g.rank0())
  {
    s.info();
    s.print();
  }
  
  g.exit();
  g.finalize();
  
  return 0;
}

In practice, using such small block sizes for an MPI matrix is probably not a good idea; we only do so for the sake of demonstration (we want each process to own some data). We can build this new example via:

mpicxx -I/path/to/fml/src svd.cpp -fopenmp  svd.cpp -o svd -lscalapack-openmpi

We can launch the example with multiple processes via

mpirun -np 4 ./svd

And here we see:

## Grid 0 2x2

# mpimat 3x2 on 2x2 grid type=f
1 4 
2 5 
3 6 

# cpuvec 2 type=f
9.5080 0.7729 

High-Level Language Bindings

Header and API Stability

tldr:

  • Use the super headers (or read the long explanation)
    • CPU - fml/cpu.hh
    • GPU - fml/gpu.hh
    • MPI - fml/mpi.hh
    • PAR still evolving
  • Existing API's are largely stable. Most changes will be additions rather than modifications.

The project is young and things are still mostly evolving. The current status is:

Headers

There are currently "super headers" for CPU (fml/cpu.hh), GPU (fml/gpu.hh), and MPI (fml/mpi.hh) backends. These include all relevant sub-headers. These are "frozen" in the sense that they will not move and will always include everything. However, as more namespaces are added, those too will be included in the super headers. The headers one folder level deep (e.g. those in fml/cpu) are similarly frozen, although more may be added over time. Headers two folder levels

Internals are evolving and subject to change at basically any time. Notable changes will be mentioned in the changelog.

API

  • Frozen: Existing APIs will not be developed further.
    • none
  • Stable: Existing APIs are not expected to change. Some new features may be added slowly.
    • cpumat/gpumat/mpimat classes
    • copy namespace functions
    • linalg namespace functions (all but parmat)
  • Stabilizing: Core class naming and construction/destruction is probably finalized. Function/method names and arguments are solidifying, but may change somewhat. New features are still being developed.
    • dimops namespace functions
  • Evolving: Function/method names and arguments are subject to change. New features are actively being developed.
    • stats namespace functions
  • Experimental: Nothing is remotely finalized.
    • parmat - all functions and methods

Internals are evolving and subject to change at basically any time. Notable changes will be mentioned in the changelog.

Philosophy and Similar Projects

Some similar C/C++ projects worth mentioning:

These are all great libraries which have stood the test of time. Armadillo in particular is worthy of a look, as it has a very nice interface and very extensive set of functions. However, to my knowledge, all of these focus exclusively on CPU computing. There are some extensions to Armadillo and Eigen for GPU computing. And for gemm-heavy codes, you can use nvblas to offload some work to the GPU, but this doesn't always achieve good performance. And none of the above include distributed computing, except for PETSc which focuses on sparse matrices.

There are probably many other C++ frameworks in this arena, but none to my knowledge have a similar scope to fml.

Probably the biggest influence on my thinking for this library is the pbdR package ecosystem for HPC with the R language, which I have worked on for many years now. Some obvious parallels are:

The basic philosophy of fml is:

  • Be relatively small and self-contained.
  • Follow general C++ conventions by default (like RAII and exceptions), but give the ability to break these for the sake of performance.
  • Changing a code from one object type to another should be very simple, ideally with no changes to the source (the internals will simply Do The Right Thing (tm)), with the exception of:
    • object creation
    • printing (e.g. printing on only one MPI rank)
  • Use a permissive open source license.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].