Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → facebookresearch → Fbtt Embedding

facebookresearch / Fbtt Embedding

Licence: mit

This is a Tensor Train based compression library to compress sparse embedding tables used in large-scale machine learning models such as recommendation and natural language processing. We showed this library can reduce the total model size by up to 100x in Facebook’s open sourced DLRM model while achieving same model quality. Our implementation is faster than the state-of-the-art implementations. Existing the state-of-the-art library also decompresses the whole embedding tables on the fly therefore they do not provide memory reduction during runtime of the training. Our library decompresses only the requested rows therefore can provide 10,000 times memory footprint reduction per embedding table. The library also includes a software cache to store a portion of the entries in the table in decompressed format for faster lookup and process.

Labels

cuda

Projects that are alternatives of or similar to Fbtt Embedding

2016 super resolution

ICCV2015 Image Super-Resolution Using Deep Convolutional Networks

Stars: ✭ 78 (-15.22%)

Mutual labels: cuda

Minhashcuda

Weighted MinHash implementation on CUDA (multi-gpu).

Stars: ✭ 88 (-4.35%)

Mutual labels: cuda

Aurora

Minimal Deep Learning library is written in Python/Cython/C++ and Numpy/CUDA/cuDNN.

Stars: ✭ 90 (-2.17%)

Mutual labels: cuda

Modulated Deform Conv

deformable convolution 2D 3D DeformableConvolution DeformConv Modulated Pytorch CUDA

Stars: ✭ 81 (-11.96%)

Mutual labels: cuda

Python Opencv Cuda

custom opencv_contrib module which exposes opencv cuda optical flow methods with python bindings

Stars: ✭ 86 (-6.52%)

Mutual labels: cuda

Deep Learning With Cats

Deep learning with cats (^._.^)

Stars: ✭ 1,290 (+1302.17%)

Mutual labels: cuda

Hiop

HPC solver for nonlinear optimization problems

Stars: ✭ 75 (-18.48%)

Mutual labels: cuda

Tutorial Ubuntu 18.04 Install Nvidia Driver And Cuda And Cudnn And Build Tensorflow For Gpu

Ubuntu 18.04 How to install Nvidia driver + CUDA + CUDNN + build tensorflow for gpu step by step command line

Stars: ✭ 91 (-1.09%)

Mutual labels: cuda

Deep Learning Boot Camp

A community run, 5-day PyTorch Deep Learning Bootcamp

Stars: ✭ 1,270 (+1280.43%)

Mutual labels: cuda

Deeppipe2

Deep Learning library using GPU(CUDA/cuBLAS)

Stars: ✭ 90 (-2.17%)

Mutual labels: cuda

Pytorch Emdloss

PyTorch 1.0 implementation of the approximate Earth Mover's Distance

Stars: ✭ 82 (-10.87%)

Mutual labels: cuda

Knn cuda

pytorch knn [cuda version]

Stars: ✭ 86 (-6.52%)

Mutual labels: cuda

Weighted softmax loss

Weighted Softmax Loss Layer for Caffe

Stars: ✭ 89 (-3.26%)

Mutual labels: cuda

Nnabla Ext Cuda

A CUDA Extension of Neural Network Libraries

Stars: ✭ 79 (-14.13%)

Mutual labels: cuda

Matconvnet

MatConvNet: CNNs for MATLAB

Stars: ✭ 1,299 (+1311.96%)

Mutual labels: cuda

Cuda Design Patterns

Some CUDA design patterns and a bit of template magic for CUDA

Stars: ✭ 78 (-15.22%)

Mutual labels: cuda

Thundersvm

ThunderSVM: A Fast SVM Library on GPUs and CPUs

Stars: ✭ 1,282 (+1293.48%)

Mutual labels: cuda

Numer

Numeric Erlang - vector and matrix operations with CUDA. Heavily inspired by Pteracuda - https://github.com/kevsmith/pteracuda

Stars: ✭ 91 (-1.09%)

Mutual labels: cuda

Elasticfusion

Real-time dense visual SLAM system

Stars: ✭ 1,298 (+1310.87%)

Mutual labels: cuda

Halloc

A fast and highly scalable GPU dynamic memory allocator

Stars: ✭ 89 (-3.26%)

Mutual labels: cuda

View All Similar Projects ➔

FBTT-Embedding

FBTT-Embedding library provides functionality to compress sparse embedding tables commonly usedin machine learning models such as recommendation and natural language processing. The library can be used as a direct replacement to PyTorch’s EmbeddingBag functionality. It provides the forward and backward propagation functionality same as PyTorch’s EmbeddingBag with only difference of compression. In addition, our implementation includes a software cache to store a portion of the entries in the embedding tables (or “bag”s) in decompressed format for faster lookup and process removing the need for decompressing and compressing the entries every-time it is accessed during the program execution of training or inference.

Read more at "TT-Rec: Tensor Train Compression for Deep Learning Recommendation Models", accepted and to appear in the Proceedings of Conference on Machine Learning and Systems, MLSys 2021.

Installing FBTT-Embedding

The implementation was compiled and tested with PyTorch 1.6 and above, with CUDA 10.1.
To install the library, run setup.py install In order to get the best performance, specify -gencode for nvcc in the setup.py (line 24). The corresponding compiler settings for each architecture can be found in CUDA Toolkit Documentation, for example Volta https://docs.nvidia.com/cuda/volta-compatibility-guide/index.html Pascal https://docs.nvidia.com/cuda/pascal-compatibility-guide/index.html
Dependence cub https://nvlabs.github.io/cub/
A sample run of the code, run tt_embeddings_benchmark.py. A sample output is shown below

INFO:root:sparse: True, optimizer: sgd
INFO:root:p_shapes: [200, 220, 250], q_shapes: [4, 4, 4], ranks: [32, 32]
INFO:root:B: 512, E: 11000000, D: 64, nnz: 10240
INFO:root:TTEmbeddingBag FWD-BWD time/nnz: 0.416 usecs, GFLOPS: 2657.631, BW: 18.456

How FBTT-Embedding works

Parameters

num_embeddings(int) — size of the dictionary of embeddings
embedding_dim(int) — the length of each embedding vector
tt_ranks(List[int]) — the ranks of TT cores
tt_p_shapes(Optional[List[int]]) — The factorization of num_embeddings, where the product of all elements is not smaller than num_embeddings.
tt_q_shapes(Optional[List[int]]) — The factorization of embedding_dim, where the product of all elements is equal to embedding_dim.
sparse (bool) — if True, the weight update is fused with gradient calculation, and gradients are not returned by backward propagation. Otherwise the gradient w.r.t. TT cores or cache would be returned to external optimizer.
optimizer (OptimType) — The type of optimizer when using fused kernel.
learning_rate (float) — Learning rate of the optimizer
eps (float) — term added to the denominator to improve numerical stability, for Adagrad only.
use_cache (bool) — if True, a software cache will be used to store the most-frequently-accessed embedding vectors.
cache_size (int) — The maximum number of embedding vectors to be stored in the cache.
hashtbl_size (int) — The maximum number of entries in the hash table for frequency count.
weight_dist (str) — “uniform”, “normal”, “approx-uniform”, “approx-normal”. When using “uniform” or “normal”, the weights of TT cores will be i.i.d from the specified distribution. When using “approx-uniform” or “approx-normal”, the TT cores are initialized in a way that the entries of full embedding table follow the specified distribution.

Initialization

The initialization of TT-Emb is similar to Pytorch EmbeddingBag

tt_emb = TTEmbeddingBag(
        num_embeddings=1000000,
        embedding_dim=64,
        tt_p_shapes=[120, 90, 110],
        tt_q_shapes=[4, 4, 4],
        tt_ranks=[12, 14],
        sparse=False,
        use_cache=False,
        weight_dist="uniform"
    )

This method will generate TT cores representing an embedding table of size 1000000 by 64 (num_embeddings x embedding_dim), where each TT core is of size ranks[i] x tt_p_shapes[i] x tt_q_shapes[i] x ranks[i+1] and ranks = [1]+tt_rank+[1]. In this case, the shape of the 3 TT-cores are 1 x 120 x 4 x 12, 12 x 90 x 4 x 16, and 14 x 110 x 4 x 1. When tt_p_shapes and tt_q_shapes are specified, the product of tt_p_shapes[] needs be no smaller than num_embeddings; the product of tt_q_shapes must be equal to embedding_dim. When passing these 2 parameters as None, TTEmbeddingBag will factor num_embeddings and embedding_dim automatically.

>>> # an Embedding module containing 10 tensors of size 3
>>> embedding_sum *=* TTEmbeddingBag(10, 3, None, None, tt_ranks=[2,2], sparse=False, use_cache=False)
>>> # a batch of 2 samples of 4 indices each
>>> input *=* torch*.*LongTensor([1,2,4,5,4,3,2,9])
>>> offsets *=* torch*.*LongTensor([0,4])
>>> embedding_sum(input, offsets)
tensor([[-0.8861, -5.4350, -0.0523],
[ 1.1306, -2.5798, -1.0044]])

Fused Kernel

TT-Emb supports fused gradient computation and weight updates for better efficiency, where the weights of embedding tables are updated along with backward propagation. If the network is trained with an external optimizer, the gradients will no longer be returned to the optimizer. To enable the fused kernel, specify sparse=True, and pass the corresponding optimizer type and parameters to TTEmbeddingBag. For example,

tt_emb = TTEmbeddingBag(
        num_embeddings=1000000,
        embedding_dim=64,
        tt_p_shapes=[120, 90, 110],
        tt_q_shapes=[4, 4, 4],
        tt_ranks=[12, 14],
        *sparse**=True,
        optimizer=OptimType.SGD,
        learning_rate=0.05,*
        eps=1.0e-10 #for ADAGRAD only
        use_cache=False,
    )

Software Cache

Embedding lookup in TT-Rec requires explicitly computing the embedding vectors from TT cores via two consecutive matrix multiplications (GEMM). Similarly in backward propagation, the gradient of each tensor core is calculated through a chain of matrix multiplications in reversed order. To reduce the computation during training, TT-Emb implements a software cache to store an uncompressed copy of the most-frequently queried embedding vectors. When such vectors are queried, the vectors can be loaded directly from cache without computation. The size of cache can be determined according to each hardware platform and dataset, so that cached embedding rows capture as many embedding lookups as possible while minimizing the memory requirement during training. We implemented a 32-way set-associative Least-Frequently-Used (LFU) cache using open addressing hash table for frequency count. To enable cache for TT-Emb, specify use_cache=True and set the appropriate cache size as the maximum number of embedding vectors to store in the cache, and max hash table size.

tt_emb = TTEmbeddingBag(
        num_embeddings=1000000,
        embedding_dim=64,
        tt_p_shapes=[120, 90, 110],
        tt_q_shapes=[4, 4, 4],
        tt_ranks=[12, 14],
        sparse=True,
        optimizer=OptimType.SGD,
        learning_rate=0.05,
        eps=1.0e-10 #for ADAGRAD only
        *use_cache**=True,
        cache_size=1000,
        hashtbl_size=1000*
    )

During forward propagation, the access frequency of embedding vectors will be updated. However, the cache will only be updated when tt_emb.cache_populate() is called. The cached rows are determined by the access frequency of the embedding vectors, and the value of each embedding vector would be initialized from TT cores.

License

FBTT-Embedding is MIT licensed, as found in the LICENSE file.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 92

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (2) 🔗