All Projects → jina-ai → docarray

jina-ai / docarray

Licence: Apache-2.0 License
The data structure for unstructured data

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects
EJS
674 projects

Projects that are alternatives of or similar to docarray

nimpb
Protocol Buffers for Nim
Stars: ✭ 29 (-94.83%)
Mutual labels:  protobuf
sqlite-dotnet-core
.NET Core 2.1 Console Application using SQLite with Entity Framework and Dependency Injection
Stars: ✭ 17 (-96.97%)
Mutual labels:  sqlite
italy
Free open public domain football data (football.db) for Italy / Europe - Serie A etc.
Stars: ✭ 35 (-93.76%)
Mutual labels:  sqlite
fieldmask-utils
Protobuf Field Mask Go utils
Stars: ✭ 127 (-77.36%)
Mutual labels:  protobuf
ocaml-protoc-plugin
ocaml-protoc-plugin
Stars: ✭ 36 (-93.58%)
Mutual labels:  protobuf
pb3-gen-sol
Generate solidity decoders from proto3 files
Stars: ✭ 15 (-97.33%)
Mutual labels:  protobuf
Diverse-Structure-Inpainting
CVPR 2021: "Generating Diverse Structure for Image Inpainting With Hierarchical VQ-VAE"
Stars: ✭ 131 (-76.65%)
Mutual labels:  multimodal
ppx deriving protobuf
A Protocol Buffers codec generator for OCaml
Stars: ✭ 76 (-86.45%)
Mutual labels:  protobuf
yaramanager
Simple yara rule manager
Stars: ✭ 60 (-89.3%)
Mutual labels:  sqlite
scalapb-playjson
Json/Protobuf convertors for ScalaPB use play-json
Stars: ✭ 15 (-97.33%)
Mutual labels:  protobuf
rails-microservices-book
A guide to building distributed Ruby on Rails applications using Protocol Buffers, NATS and RabbitMQ
Stars: ✭ 23 (-95.9%)
Mutual labels:  protobuf
aiosqlite3
sqlite3 on asyncio use loop.run_in_executor proxy
Stars: ✭ 21 (-96.26%)
Mutual labels:  sqlite
graphgrove
A framework for building (and incrementally growing) graph-based data structures used in hierarchical or DAG-structured clustering and nearest neighbor search
Stars: ✭ 29 (-94.83%)
Mutual labels:  nearest-neighbor-search
librdf.sqlite
♊️ Mirror of https://code.mro.name/mro/librdf.sqlite | 🛠 improved SQLite RDF triple store for Redland librdf
Stars: ✭ 21 (-96.26%)
Mutual labels:  sqlite
protoc-gen-micro
Protobuf code generation
Stars: ✭ 287 (-48.84%)
Mutual labels:  protobuf
vtprotobuf
A Protocol Buffers compiler that generates optimized marshaling & unmarshaling Go code for ProtoBuf APIv2
Stars: ✭ 418 (-25.49%)
Mutual labels:  protobuf
finetuner
Finetuning any DNN for better embedding on neural search tasks
Stars: ✭ 442 (-21.21%)
Mutual labels:  neural-search
room-populate-demo
Room database pre-population demo
Stars: ✭ 17 (-96.97%)
Mutual labels:  sqlite
mdb2sqlite
Conversion tool used to convert microsoft access database to sqlite.
Stars: ✭ 79 (-85.92%)
Mutual labels:  sqlite
tsrpc
A TypeScript RPC framework, with runtime type checking and serialization, support both HTTP and WebSocket. It is very suitable for website / APP / games, and absolutely comfortable to full-stack TypeScript developers.
Stars: ✭ 866 (+54.37%)
Mutual labels:  protobuf

DocArray logo: The data structure for unstructured data
The data structure for unstructured data

Python 3.7 3.8 3.9 3.10 PyPI

DocArray is a library for nested, unstructured data in transit, including text, image, audio, video, 3D mesh, etc. It allows deep-learning engineers to efficiently process, embed, search, recommend, store, and transfer the data with a Pythonic API.

🌌 Rich data types: super-expressive data structure for representing complicated/mixed/nested text, image, video, audio, 3D mesh data.

🐍 Pythonic experience: designed to be as easy as a Python list. If you know how to Python, you know how to DocArray. Intuitive idioms and type annotation simplify the code you write.

🧑‍🔬 Data science powerhouse: greatly accelerate data scientists' work on embedding, matching, visualizing, evaluating via Torch/TensorFlow/ONNX/PaddlePaddle on CPU/GPU.

🚡 Data in transit: optimized for network communication, ready-to-wire at anytime with fast and compressed serialization in Protobuf, bytes, base64, JSON, CSV, DataFrame.

🎡 Scale to big data: handle out-of-memory data via on-disk document store while staying with exact same API experience. Supporting classic databases and vector databases to enable faster nearest neighbour search.

👒 For modern apps: GraphQL support makes your server versatile on request and response; built-in data validation and JSON Schema (OpenAPI) help you build reliable webservices.

🛸 Integrate with IDE: pretty-print and visualization on Jupyter notebook & Google Colab; comprehensive auto-complete and type hint in PyCharm & VS Code.

Read more on why should you use DocArray and comparison to alternatives.

Install

Requires Python 3.7+ and numpy only:

pip install docarray

or via Conda:

conda install -c conda-forge docarray

Commonly used features can be enabled via pip install "docarray[common]".

Documentation

Get Started

DocArray consists of two simple concepts:

  • Document: a data structure for easily representing nested, unstructured data.
  • DocumentArray: a container for efficiently accessing, manipulating, and understanding multiple Documents.

A 10-liners text matching

Let's search for top-5 similar sentences of she smiled too much in "Pride and Prejudice".

from docarray import Document, DocumentArray

d = Document(uri='https://www.gutenberg.org/files/1342/1342-0.txt').load_uri_to_text()
da = DocumentArray(Document(text=s.strip()) for s in d.text.split('\n') if s.strip())
da.apply(lambda d: d.embed_feature_hashing())

q = (Document(text='she smiled too much')
     .embed_feature_hashing()
     .match(da, metric='jaccard', use_scipy=True))

print(q.matches[:5, ('text', 'scores__jaccard__value')])
[['but she smiled too much.', 
  '_little_, she might have fancied too _much_.', 
  'She perfectly remembered everything that had passed in', 
  'tolerably detached tone. While she spoke, an involuntary glance', 
  'much as she chooses.”'], 
  [0.3333333333333333, 0.6666666666666666, 0.7, 0.7272727272727273, 0.75]]

Here the feature embedding is done by simple feature hashing and distance metric is Jaccard distance. You have better embeddings? Of course you do! We look forward to seeing your results!

A complete workflow of visual search

Let's use DocArray and the Totally Looks Like dataset to build a simple meme image search. The dataset contains 6,016 image-pairs stored in /left and /right. Images that share the same filename are perceptually similar. For example:

left/00018.jpg right/00018.jpg left/00131.jpg right/00131.jpg
Visualizing top-9 matches using DocArray API Visualizing top-9 matches using DocArray API Visualizing top-9 matches using DocArray API Visualizing top-9 matches using DocArray API

Our problem is given an image from /left, can we find its most-similar image in /right? (without looking at the filename of course).

Load images

First we load images. You can go to Totally Looks Like website, unzip and load images as below:

from docarray import DocumentArray

left_da = DocumentArray.from_files('left/*.jpg')

Or you can simply pull it from Jina Cloud:

left_da = DocumentArray.pull('demo-leftda', show_progress=True)

You will see a running progress bar to indicate the downloading process.

To get a feeling of the data you will handle, plot them in one sprite image:

left_da.plot_image_sprites()

Load totally looks like dataset with docarray API

Apply preprocessing

Let's do some standard computer vision pre-processing:

from docarray import Document

def preproc(d: Document):
    return (d.load_uri_to_image_tensor()  # load
             .set_image_tensor_normalization()  # normalize color 
             .set_image_tensor_channel_axis(-1, 0))  # switch color axis for the PyTorch model later

left_da.apply(preproc)

Did I mention apply works in parallel?

Embed images

Now convert images into embeddings using a pretrained ResNet50:

import torchvision
model = torchvision.models.resnet50(pretrained=True)  # load ResNet50
left_da.embed(model, device='cuda')  # embed via GPU to speed up

This step takes ~30 seconds on GPU. Beside PyTorch, you can also use TensorFlow, PaddlePaddle, or ONNX models in .embed(...).

Visualize embeddings

You can visualize the embeddings via tSNE in an interactive embedding projector:

left_da.plot_embeddings()

Visualizing embedding via tSNE and embedding projector

Fun is fun, but recall our goal is to match left images against right images and so far we have only handled the left. Let's repeat the same procedure for the right:

Pull from Cloud Download, unzip, load from local
right_da = (DocumentArray.pull('demo-rightda', show_progress=True)
                         .apply(preproc)
                         .embed(model, device='cuda'))
right_da = (DocumentArray.from_files('right/*.jpg')
                         .apply(preproc)
                         .embed(model, device='cuda'))

Match nearest neighbours

We can now match the left to the right and take the top-9 results.

left_da.match(right_da, limit=9)

Let's inspect what's inside left_da matches now:

for d in left_da:
    for m in d.matches:
        print(d.uri, m.uri, m.scores['cosine'].value)
left/02262.jpg right/03459.jpg 0.21102
left/02262.jpg right/02964.jpg 0.13871843
left/02262.jpg right/02103.jpg 0.18265384
left/02262.jpg right/04520.jpg 0.16477376
...

Or shorten the loop as one-liner using the element & attribute selector:

print(left_da['@m', ('uri', 'scores__cosine__value')])

Better see it.

(DocumentArray(left_da[8].matches, copy=True)
    .apply(lambda d: d.set_image_tensor_channel_axis(0, -1)
                      .set_image_tensor_inv_normalization())
    .plot_image_sprites())

Visualizing top-9 matches using DocArray API Visualizing top-9 matches using DocArray API

What we did here is revert the preprocessing steps (i.e. switching axis and normalizing) on the copied matches, so that you can visualize them using image sprites.

Quantitative evaluation

Serious as you are, visual inspection is surely not enough. Let's calculate the recall@K. First we construct the groundtruth matches:

groundtruth = DocumentArray(
    Document(uri=d.uri, matches=[Document(uri=d.uri.replace('left', 'right'))]) for d in left_da)

Here we create a new DocumentArray with real matches by simply replacing the filename, e.g. left/00001.jpg to right/00001.jpg. That's all we need: if the predicted match has the identical uri as the groundtruth match, then it is correct.

Now let's check recall rate from 1 to 5 over the full dataset:

for k in range(1, 6):
    print(f'recall@{k}',
          left_da.evaluate(
            groundtruth,
            hash_fn=lambda d: d.uri,
            metric='recall_at_k',
            k=k,
            max_rel=1))
recall@1 0.02726063829787234
recall@2 0.03873005319148936
recall@3 0.04670877659574468
recall@4 0.052194148936170214
recall@5 0.0573470744680851

More metrics can be used such as precision_at_k, ndcg_at_k, hit_at_k.

If you think a pretrained ResNet50 is good enough, let me tell you with Finetuner you could do much better in just 10 extra lines of code. Here is how.

Save results

You can save a DocumentArray to binary, JSON, dict, DataFrame, CSV or Protobuf message with/without compression. In its simplest form,

left_da.save('left_da.bin')

To reuse it, do left_da = DocumentArray.load('left_da.bin').

If you want to transfer a DocumentArray from one machine to another or share it with your colleagues, you can do:

left_da.push(token='my_shared_da')

Now anyone who knows the token my_shared_da can pull and work on it.

left_da = DocumentArray.pull(token='my_shared_da')

Intrigued? That's only scratching the surface of what DocArray is capable of. Read our docs to learn more.

Support

Join Us

DocArray is backed by Jina AI and licensed under Apache-2.0. We are actively hiring AI engineers, solution engineers to build the next neural search ecosystem in open-source.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].