All Projects → dblalock → Bolt

dblalock / Bolt

Licence: mpl-2.0
Fast approximate vector operations

Projects that are alternatives of or similar to Bolt

Pgm Index
🏅State-of-the-art learned data structure that enables fast lookup, predecessor, range searches and updates in arrays of billions of items using orders of magnitude less space than traditional indexes
Stars: ✭ 499 (+612.86%)
Mutual labels:  database, compression
Game Datasets
🎮 A curated list of awesome game datasets, and tools to artificial intelligence in games
Stars: ✭ 261 (+272.86%)
Mutual labels:  data-mining, database
Pygm
🐍 Python library implementing sorted containers with state-of-the-art query performance and compressed memory usage
Stars: ✭ 156 (+122.86%)
Mutual labels:  database, compression
Linkedingiveaway
👨🏽‍🏫You can learn about anything over here. What Giveaways I do and why it's important in today's modern world. Are you interested in Giveaway's?🔋
Stars: ✭ 67 (-4.29%)
Mutual labels:  data-mining, database
Zabbixdba
Zabbix Database Monitoring Service (Oracle, Pg, MySQL, MS SQL, DB2, etc.)
Stars: ✭ 68 (-2.86%)
Mutual labels:  database
Tlddatabase
[DEPRECATED] Abstraction layer for Public Suffix List in PHP
Stars: ✭ 67 (-4.29%)
Mutual labels:  database
Gorse
An open source recommender system service written in Go
Stars: ✭ 1,148 (+1540%)
Mutual labels:  data-mining
Oblecto
Oblecto is a media server, which streams media you already own, and is designed to be at the heart of your entertainment experience. It runs on your home server to index and analyze your media such as Movies and TV Shows and presents them in an interface tailored for your media consupmtion needs.
Stars: ✭ 67 (-4.29%)
Mutual labels:  database
Keydb
high performance key value database written in Go
Stars: ✭ 70 (+0%)
Mutual labels:  database
Ffbe
Datamining for FFBE GL
Stars: ✭ 69 (-1.43%)
Mutual labels:  data-mining
Directus Docker
Directus 6 Docker — Legacy Container [EOL]
Stars: ✭ 68 (-2.86%)
Mutual labels:  database
Dbmigrations
A library for the creation, management, and installation of schema updates for relational databases.
Stars: ✭ 67 (-4.29%)
Mutual labels:  database
Awesome Business Intelligence
Actively curated list of awesome BI tools. PRs welcome!
Stars: ✭ 1,157 (+1552.86%)
Mutual labels:  database
Books
Awesome Books
Stars: ✭ 66 (-5.71%)
Mutual labels:  database
Node Blog
🔥✨ A react blog project base on nodejs, nestjs, mongoose, typescript, react, ant-design,nextjs
Stars: ✭ 69 (-1.43%)
Mutual labels:  database
Covenantsql
A decentralized, trusted, high performance, SQL database with blockchain features
Stars: ✭ 1,148 (+1540%)
Mutual labels:  database
Dbx
A neat codegen-based database wrapper written in Go
Stars: ✭ 68 (-2.86%)
Mutual labels:  database
Scan
Scan database/sql rows directly to structs, slices, and primitive types
Stars: ✭ 69 (-1.43%)
Mutual labels:  database
Etl with python
ETL with Python - Taught at DWH course 2017 (TAU)
Stars: ✭ 68 (-2.86%)
Mutual labels:  database
Numcompress
Python package to compress numerical series & numpy arrays into strings
Stars: ✭ 68 (-2.86%)
Mutual labels:  compression

Bolt

Bolt is an algorithm for compressing vectors of real-valued data and running mathematical operations directly on the compressed representations.

If you have a large collection of mostly-dense vectors and can tolerate lossy compression, Bolt can probably save you 10-200x space and compute time.

Bolt also has theoretical guarantees bounding the errors in its approximations.

Installing

Python

  $ brew install swig  # for wrapping C++; use apt-get, yum, etc, if not OS X
  $ pip install numpy  # bolt installation needs numpy already present
  $ git clone https://github.com/dblalock/bolt.git
  $ cd bolt && python setup.py install
  $ pytest tests/  # optionally, run the tests

If you run into any problems, please don't hesitate to mention it in the Python build problems issue.

C++

Install Bazel, Google's open-source build system. Then

  $ git clone https://github.com/dblalock/bolt.git
  $ cd bolt/cpp && bazel run :main

The bazel run command will build the project and run the tests and benchmarks.

If you want to integrate Bolt with another C++ project, include cpp/src/include/public.hpp and add the remaining files under cpp/src to your builds. You should let me know if you're interested in doing such an integration because I'm hoping to see Bolt become part of many libraries and thus would be happy to help you.

Notes

Bolt currently only supports machines with AVX2 instructions, which basically means x86 machines from fall 2013 or later. Contributions for ARM support are welcome. Also note that the Bolt Python wrapper is currently configured to require Clang, since GCC apparently runs into issues.

How does it work?

Bolt is based on vector quantization. For details, see the Bolt paper or slides.

Benchmarks

Bolt includes a thorough set of speed and accuracy benchmarks. See the experiments/ directory. This is also what you want if you want to reproduce the results in the paper.

Note that all of the timing results use the raw C++ implementation. At present, the Python wrapper is slightly slower due to Python overhead. If you're interested in having a full-speed wrapper, let me know and I'll allocate time to making this happen.

Basic usage

X, queries = some N x D array, some iterable of length D arrays

# these are approximately equal (though the latter are shifted and scaled)
enc = bolt.Encoder(reduction='dot').fit(X)
[np.dot(X, q) for q in queries]
[enc.transform(q) for q in queries]

# same for these
enc = bolt.Encoder(reduction='l2').fit(X)
[np.sum((X - q) * (X - q), axis=1) for q in queries]
[enc.transform(q) for q in queries]

# but enc.transform() is 10x faster or more

Example: Matrix-vector multiplies

import bolt
import numpy as np
from scipy.stats import pearsonr as corr
from sklearn.datasets import load_digits
import timeit

# for simplicity, use the sklearn digits dataset; we'll split
# it into a matrix X and a set of queries Q
X, _ = load_digits(return_X_y=True)
nqueries = 20
X, Q = X[:-nqueries], X[-nqueries:]

enc = bolt.Encoder(reduction='dot', accuracy='lowest') # can tweak acc vs speed
enc.fit(X)

dot_corrs = np.empty(nqueries)
for i, q in enumerate(Q):
    dots_true = np.dot(X, q)
    dots_bolt = enc.transform(q)
    dot_corrs[i] = corr(dots_true, dots_bolt)[0]

# dot products closely preserved despite compression
print "dot product correlation: {} +/- {}".format(
    np.mean(dot_corrs), np.std(dot_corrs))  # > .97

# massive space savings
print(X.nbytes)  # 1777 rows * 64 cols * 8B = 909KB
print(enc.nbytes)  # 1777 * 2B = 3.55KB

# massive time savings (~10x here, but often >100x on larger
# datasets with less Python overhead; see the paper)
t_np = timeit.Timer(
    lambda: [np.dot(X, q) for q in Q]).timeit(5)        # ~9ms
t_bolt = timeit.Timer(
    lambda: [enc.transform(q) for q in Q]).timeit(5)    # ~800us
print "Numpy / BLAS time, Bolt time: {:.3f}ms, {:.3f}ms".format(
    t_np * 1000, t_bolt * 1000)

# can get output without offset/scaling if needed
dots_bolt = [enc.transform(q, unquantize=True) for q in Q]

Example: K-Nearest Neighbor / Maximum Inner Product Search

# search using squared Euclidean distances
# (still using the Digits dataset from above)
enc = bolt.Encoder('l2', accuracy='high').fit(X)
bolt_knn = [enc.knn(q, k_bolt) for q in Q]  # knn for each query

# search using dot product (maximum inner product search)
enc = bolt.Encoder('dot', accuracy='medium').fit(X)
bolt_knn = [enc.knn(q, k_bolt) for q in Q]  # knn for each query

Miscellaneous

Bolt stands for "Based On Lookup Tables". Feel free to use this exciting fact at parties.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].