All Projects → jina-ai → executor-hnsw-postgres

jina-ai / executor-hnsw-postgres

Licence: other
A production-ready, scalable Indexer for the Jina neural search framework, based on HNSW and PSQL

Programming Languages

python
139335 projects - #7 most used programming language
Dockerfile
14818 projects

Projects that are alternatives of or similar to executor-hnsw-postgres

accumulator
Cryptographic accumulators in Rust.
Stars: ✭ 115 (+360%)
Mutual labels:  vector
GDAL.jl
Thin Julia wrapper for GDAL - Geospatial Data Abstraction Library
Stars: ✭ 78 (+212%)
Mutual labels:  vector
Harbol
Harbol is a collection of data structure and miscellaneous libraries, similar in nature to C++'s Boost, STL, and GNOME's GLib
Stars: ✭ 18 (-28%)
Mutual labels:  vector
footile
A 2D vector graphics library written in Rust
Stars: ✭ 32 (+28%)
Mutual labels:  vector
vectorexpress-api
Vector Express is a free service and API for converting, analyzing and processing vector files.
Stars: ✭ 66 (+164%)
Mutual labels:  vector
SCNMathExtensions
Math extensions for SCNVector3, SCNQuaternion, SCNMatrix4
Stars: ✭ 32 (+28%)
Mutual labels:  vector
FUTURE
A private, free, open-source search engine built on a P2P network
Stars: ✭ 19 (-24%)
Mutual labels:  hnswlib
helm-charts
Helm charts for Vector.
Stars: ✭ 50 (+100%)
Mutual labels:  vector
lvg
Lion Vector Graphics
Stars: ✭ 106 (+324%)
Mutual labels:  vector
OffsetGuided
Code for "Greedy Offset-Guided Keypoint Grouping for Human Pose Estimation"
Stars: ✭ 31 (+24%)
Mutual labels:  vector
milvus-sdk-java
Java SDK for Milvus.
Stars: ✭ 216 (+764%)
Mutual labels:  vector
matrixgl
Yet another matrix library for WebGL
Stars: ✭ 25 (+0%)
Mutual labels:  vector
cpp-code-snippets
Some useful C++ code snippets
Stars: ✭ 35 (+40%)
Mutual labels:  vector
elm-3d-camera
Camera type for doing 3D rendering in Elm
Stars: ✭ 12 (-52%)
Mutual labels:  vector
VecFor
Vector algebra class for Fortran poor people
Stars: ✭ 28 (+12%)
Mutual labels:  vector
cogj-spec
Cloud Optimized GeoJSON spec
Stars: ✭ 36 (+44%)
Mutual labels:  vector
BottomNavigation-RichPath-Sample
BottomNavigation RichPath Sample
Stars: ✭ 76 (+204%)
Mutual labels:  vector
vektonn
vektonn.github.io/vektonn
Stars: ✭ 109 (+336%)
Mutual labels:  vector
earthwyrm
Vector tile map server for openstreetmap data
Stars: ✭ 16 (-36%)
Mutual labels:  vector
vector
A high-performance observability data pipeline.
Stars: ✭ 12,138 (+48452%)
Mutual labels:  vector

🌟 HNSW + PostgreSQL Indexer

HNSWPostgreSQLIndexer is a production-ready, scalable Indexer for the Jina neural search framework.

It combines the reliability of PostgreSQL with the speed and efficiency of the HNSWlib nearest neighbor library.

It thus provides all the CRUD operations expected of a database system, while also offering fast and reliable vector lookup.

Requires a running PostgreSQL database service. For quick testing, you can run a containerized version locally with:

docker run -e POSTGRES_PASSWORD=123456 -p 127.0.0.1:5432:5432/tcp postgres:13.2

Syncing between PSQL and HNSW

By default, all data is stored in a PSQL database (as defined in the arguments). In order to add data to / build a HNSW index with your data, you need to manually call the /sync endpoint. This iterates through the data you have stored, and adds it to the HNSW index. By default, this is done incrementally, on top of whatever data the HNSW index already has. If you want to completely rebuild the index, use the parameter rebuild, like so:

flow.post(on='/sync', parameters={'rebuild': True})

At start-up time, the data from PSQL is synced into HNSW automatically. You can disable this with:

Flow().add(
    uses='jinahub://HNSWPostgresIndexer',
    uses_with={'startup_sync': False}
)

Automatic background syncing

WARNING: Experimental feature

Optionally, you can enable the option for automatic background syncing of the data into HNSW. This creates a thread in the background of the main operations, that will regularly perform the synchronization. This can be done with the sync_interval constructor argument, like so:

Flow().add(
    uses='jinahub://HNSWPostgresIndexer',
    uses_with={'sync_interval': 5}
)

sync_interval argument accepts an integer that represents the amount of seconds to wait between synchronization attempts. This should be adjusted based on your specific data amounts. For the duration of the background sync, the HNSW index will be locked to avoid invalid state, so searching will be queued. The same applies during search operations: the index is locked and indexing will be queued.

CRUD operations

You can perform all the usual operations on the respective endpoints

  • /index. Add new data to PostgreSQL
  • /search. Query the HNSW index with your Documents.
  • /update. Update documents in PostgreSQL
  • /delete. Delete documents in PostgreSQL.

Note. This only performs soft-deletion by default. This is done in order to not break the look-up of the Document id after doing a search. For a hard delete, add 'soft_delete': False' to parameters of the delete request. You might also perform a cleanup after a full rebuild of the HNSW index, by calling /cleanup.

Status endpoint

You can also get the information about the status of your data via the /status endpoint. This returns a dict whose tags contain the relevant information. The information can be accessed via the following keys in the parameters.__results__ of a full flow response:

  • 'psql_docs': number of Documents stored in the PSQL database (includes entries that have been "soft-deleted")
  • 'hnsw_docs': the number of Documents indexed in the HNSW index
  • 'last_sync': the time of the last synchronization of PSQL into HNSW
  • 'pea_id': the shard number

In a sharded environment (parallel>1) you will get one dict from each shard. Each shard will have its own 'hnsw_docs', 'last_sync', 'pea_id', but they will all report the same 'psql_docs' (The PSQL database is available to all your shards). You need to sum the 'hnsw_docs' across these dictionaries, like so

results = f.post('/status', None, return_responses=True)
status_results = results[0].parameters["__results__"]
total_hnsw_docs = sum(v['hnsw_docs'] for v in status_results.values())
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].