All Projects → netrasys → Pgann

netrasys / Pgann

Licence: mit
Fast Approximate Nearest Neighbor (ANN) searches with a PostgreSQL database.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Pgann

Pg stat monitor
PostgreSQL Statistics Collector
Stars: ✭ 145 (-7.05%)
Mutual labels:  postgres
Django Celery Docker Example
Example Docker setup for a Django app behind an Nginx proxy with Celery workers
Stars: ✭ 149 (-4.49%)
Mutual labels:  postgres
Cosmostore
F# Event store for Azure Cosmos DB, Table Storage, Postgres, LiteDB & ServiceStack
Stars: ✭ 154 (-1.28%)
Mutual labels:  postgres
Sqlcell
SQLCell is a magic function for the Jupyter Notebook that executes raw, parallel, parameterized SQL queries with the ability to accept Python values as parameters and assign output data to Python variables while concurrently running Python code. And *much* more.
Stars: ✭ 145 (-7.05%)
Mutual labels:  postgres
Node Express Postgresql Sequelize
Node.js, Express.js, Sequelize.js and PostgreSQL RESTful API
Stars: ✭ 148 (-5.13%)
Mutual labels:  postgres
Data Engineering Nanodegree
Projects done in the Data Engineering Nanodegree by Udacity.com
Stars: ✭ 151 (-3.21%)
Mutual labels:  postgres
Go Notify
An email automation solution, written in Golang.
Stars: ✭ 143 (-8.33%)
Mutual labels:  postgres
Atdatabases
TypeScript clients for databases that prevent SQL Injection
Stars: ✭ 154 (-1.28%)
Mutual labels:  postgres
Daemo
The Daemo crowdsourcing platform
Stars: ✭ 148 (-5.13%)
Mutual labels:  postgres
Quill
Compile-time Language Integrated Queries for Scala
Stars: ✭ 1,998 (+1180.77%)
Mutual labels:  postgres
Postgres Meta
A RESTful API for managing your Postgres. Fetch tables, add roles, and run queries
Stars: ✭ 146 (-6.41%)
Mutual labels:  postgres
Elephant Shed
PostgreSQL Management Appliance
Stars: ✭ 146 (-6.41%)
Mutual labels:  postgres
Grimoire
Database access layer for golang
Stars: ✭ 151 (-3.21%)
Mutual labels:  postgres
Nanopq
Pure python implementation of product quantization for nearest neighbor search
Stars: ✭ 145 (-7.05%)
Mutual labels:  nearest-neighbor-search
Data Science Stack Cookiecutter
🐳📊🤓Cookiecutter template to launch an awesome dockerized Data Science toolstack (incl. Jupyster, Superset, Postgres, Minio, AirFlow & API Star)
Stars: ✭ 153 (-1.92%)
Mutual labels:  postgres
Go Cache
This project encapsulates multiple db servers, redis、ledis、memcache、file、memory、nosql、postgresql
Stars: ✭ 143 (-8.33%)
Mutual labels:  postgres
Django Pgviews
Fork of django-postgres that focuses on maintaining and improving support for Postgres SQL Views.
Stars: ✭ 150 (-3.85%)
Mutual labels:  postgres
Cmd
Command line tools for database operation written by Go, moved to https://gitea.com/xorm/cmd
Stars: ✭ 154 (-1.28%)
Mutual labels:  postgres
Psql2csv
Run a query in psql and output the result as CSV.
Stars: ✭ 153 (-1.92%)
Mutual labels:  postgres
Safe Pg Migrations
Make your PostgreSQL migrations safe
Stars: ✭ 150 (-3.85%)
Mutual labels:  postgres

pgANN

Approximate Nearest Neighbor (ANN) searches using a PostgreSQL backend.

Background

Approximate Nearest Neighbor approaches are a powerful tool for various AI/ML tasks, however many existing tools (faiss,Annoy, Nearpy etc.) are "in memory"i.e. the vectors needed to be loaded into RAM, and then model trained therefrom. Furthermore, such models once trained could not be updated i.e. any CRUD operation necessitates a fresh training run.

The challenge for us was to:

  • hold extremely large datasets in memory and
  • support CRUDs which makes it challenging in an "online" environment where fresh data is continuously accumulated.

We are open-sourcing a simple, but effective approach that provides ANN searches using the powerful PostgreSQL database. At Netra we use this tool internally for managing & searching our image collections for further processing and/or feed into our Deep learning models. We consistently see sub-second response times on the order of a few million rows on a 32Gb/8 vcpu Ubuntu 16 box. We hope this is of use to the AI community.

Feedback and PRs very welcome!

Advantages

  • leverages postgres database queries & indexes (no additional tooling needed)
  • associated metadata is fetched alongwith the "neighbors" (i.e. fewer moving parts)
  • no re-training needed for CRUDs due to query time ranking ("online mode")
  • should scale with tried & tested database scaling techniques (partioning etc.)

Challenges

  • cube type doesn't seem to work for > 100 dimensions, so we need to perform dimensionality reduction. Example for dim. reduction included in the sample code
  • haven't tested with sparse vectors, but in theory should work decently with appropriate dimensionality reduction techniques
  • pgANN might not perform as accurately as some of the better known approaches, but you can use pgANN to fetch a subset of (say) a fw thousand and then rerank based on your favorite metric. Unfortunately, there are no easy wins in ANN approaches, hopefully pgANN gets you a "good enough" subset for your reranking.

Update Oct 2, 2019: There is a docker instance available here that claims "Postgres DB docker image with increased limit for CUBE datatype (set to 2048)". I haven't had a chance to try it, but this seems pretty useful (Credit due to @andrey-avdeev for this suggestion.)

Requirements

  • Postgres 10.x+ or higher (we haven't tested on PG 9.6+, but cube,GIST and distance operators are available on 9.6+, so it might work)
  • Cube extension from Postgres

Setup

  1. Make sure you are logged in as superuser into pg and run: create extension cube;

  2. We shall use the example of an images table to illustrate the approach, the images table stores the url, any metadata tags, vectors and the embeddings for the vectors in the table. You can of course modify table structure to your needs.

CREATE TABLE images(
   id serial PRIMARY KEY,
   image_url text UNIQUE NOT NULL,
   tags text,
   vectors double precision[],
   embeddings cube   
);
  1. Create a GIST index on the embeddings column which stores a 100-dimensional embedding of the original vector:

create index ix_gist on images using GIST (embeddings);

Note: you might need to create other indexes (b-tree etc.) on other fields for efficient searching & sorting, but that's outside our scope

Populating db

Now we are ready to populate the database with vectors and associated embeddings.

Note: we are using the dataset library for interfacing with postgres, but this should work just as well with your favorite driver (psycopg2 etc.)

db = dataset.connect('postgresql://user:[email protected]@localhost:5432/yourdb')
tbl = db['images']

for image in images:
   vect = get_image_vector(image) # <-- use imagenet or such model to generate a vector from one of the final layers
   emb_vect = get_embedding(vect)
   emb_string = "({0})".format(','.join("%10.8f" % x for x in emb_vect)) # <-- pg fails beyond 100 dimensions for cube, so reduce dimensionality
   row_dict["embeddings"] = emb_string
   row_dict["vectors"] = vect.tolist()
   row_id = tbl.insert(row_dict)

Querying

We can start querying the database even as population is in progress

    query_vector = [...]
    query_emb = get_embedding(query_vector)
    thresh = 0.25 # <-- making this larger will likely give more relevant results at the expense of time
	

    print ("[+] doing ANN search:")
    emb_string = "'({0})'".format(','.join("%10.8f" % x for x in query_emb))
    sql = "select id,url,tags from images where embeddings <-> cube({0}) < thresh order by embeddings <-> cube({0}) asc limit 10".format((emb_string))
    results = db.query(sql)
    for result in results:
      print(result['url'], result['tags'])
  

Note: pgsql cube extension supports multiple distance parameters. Here is a quick summary:

  • <-> Euclidean distance,
  • <=> Chebyshev (L-inf metric) distance, and
  • <#> Taxicab distance.

More details here.

Improving Performance

Here are some steps you can use to squeeze more performance out of pgANN:

  • Reduce dimensionality (using UMAP, t-SNEor <insert your favorite approach>) and using that as an embedding
  • Horizontal partitioning data across multiple tables and using parallelism to combine results
  • Use some hashing technique (LSH/MinHash for example) to create a common signature for each row and use it as a filter during your query (reducing the lookup space)
  • Try different distance operators (Euclidean, vs Chebyshev vs Taxicab),
  • Remove sorting from your query. e.g sql = "select id,embeddings from images where embeddings <-> cube({0}) <0.5".format((emb_string))

In these cases, you will need to fetch a significant N from the DB query and then re-rank based on your favorite similarity metric. Some combination of those might get you to some query times you can live with. Unfortunately ANN is largely ignored by the AI/DL community and there is significant research that needs to happen.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].