All Projects → VIDA-NYU → auctus

VIDA-NYU / auctus

Licence: Apache-2.0 license
Dataset search engine, discovering data from a variety of sources, profiling it, and allowing advanced queries on the index

Programming Languages

python
139335 projects - #7 most used programming language
typescript
32286 projects
Jsonnet
166 projects
HTML
75241 projects
CSS
56736 projects
shell
77523 projects

Projects that are alternatives of or similar to auctus

flow-indexer
Flow-Indexer indexes flows found in chunked log files from bro,nfdump,syslog, or pcap files
Stars: ✭ 43 (+26.47%)
Mutual labels:  search-engine, index
Mhtextsearch
A fast full-text search library for Objective-C
Stars: ✭ 79 (+132.35%)
Mutual labels:  search-engine, index
Riot
Go Open Source, Distributed, Simple and efficient Search Engine; Warning: This is V1 and beta version, because of big memory consume, and the V2 will be rewrite all code.
Stars: ✭ 6,025 (+17620.59%)
Mutual labels:  search-engine, index
Dig Etl Engine
Download DIG to run on your laptop or server.
Stars: ✭ 81 (+138.24%)
Mutual labels:  search-engine, crawling
Blast
Blast is a full text search and indexing server, written in Go, built on top of Bleve.
Stars: ✭ 934 (+2647.06%)
Mutual labels:  search-engine, index
Sonic
🦔 Fast, lightweight & schema-less search backend. An alternative to Elasticsearch that runs on a few MBs of RAM.
Stars: ✭ 12,347 (+36214.71%)
Mutual labels:  search-engine, index
double-agent
A test suite of common scraper detection techniques. See how detectable your scraper stack is.
Stars: ✭ 123 (+261.76%)
Mutual labels:  crawling
openblockchain
{START HERE} docker engine to roll your own openblockchain
Stars: ✭ 16 (-52.94%)
Mutual labels:  search-engine
solr
Apache Solr open-source search software
Stars: ✭ 651 (+1814.71%)
Mutual labels:  search-engine
gsc-logger
Google Search Console Logger for Google App Engine
Stars: ✭ 38 (+11.76%)
Mutual labels:  search-engine
docker-compose-search
command line utility to search docker-compose projects
Stars: ✭ 32 (-5.88%)
Mutual labels:  search-engine
the-seinfeld-chronicles
A dataset for textual analysis on arguably the best written comedy television show ever.
Stars: ✭ 14 (-58.82%)
Mutual labels:  crawling
flipper
Search/Recommendation engine and metainformation server for fanfiction net
Stars: ✭ 29 (-14.71%)
Mutual labels:  search-engine
hldig
hl://Dig is a fork of ht://Dig, a web indexing and searching system for a small domain or intranet
Stars: ✭ 17 (-50%)
Mutual labels:  search-engine
milli
Search engine library for Meilisearch ⚡️
Stars: ✭ 433 (+1173.53%)
Mutual labels:  search-engine
mudrod
Mining and Utilizing Dataset Relevancy from Oceanographic Datasets to Improve Data Discovery and Access, online demo: https://mudrod.jpl.nasa.gov/#/
Stars: ✭ 15 (-55.88%)
Mutual labels:  search-engine
shifting
A privacy-focused list of alternatives to mainstream services to help the competition.
Stars: ✭ 31 (-8.82%)
Mutual labels:  search-engine
zephyr
Mirror Android notifications to VR
Stars: ✭ 78 (+129.41%)
Mutual labels:  index
scan
DeFi Scan, everything one-stop location for DeFi Blockchain. Powered by jellyfish & ocean network.
Stars: ✭ 31 (-8.82%)
Mutual labels:  index
stackoverflow-semantic-search
Word2Vec encodings based search engine for Stackoverflow questions
Stars: ✭ 23 (-32.35%)
Mutual labels:  search-engine

Auctus

This project is a web crawler and search engine for datasets, specifically meant for data augmentation tasks in machine learning. It is able to find datasets in different repositories and index them for later retrieval.

Documentation is available here

It is divided in multiple components:

  • Libraries
    • Geospatial database datamart_geo. This contains data about administrative areas extracted from Wikidata and OpenStreetMap. It lives in its own repository and is used here as a submodule.
    • Profiling library datamart_profiler. This can be installed by clients, will allow the client library to profile datasets locally instead of sending them to the server. It is also used by the apiserver and profiler services.
    • Materialization library datamart_materialize. This is used to materialize dataset from the various sources that Auctus supports. It can be installed by clients, which will allow them to materialize datasets locally instead of using the server as a proxy.
    • Data augmentation library datamart_augmentation. This performs the join or union of two datasets and is used by the apiserver service, but could conceivably be used stand-alone.
    • Core server library datamart_core. This contains common code for services. Only used for the server components. The filesystem locking code is separate as datamart_fslock for performance reasons (has to import fast).
  • Services
    • Discovery services: those are responsible for discovering datasets. Each plugin can talk to a specific repository. Materialization metadata is recorded for each dataset, to allow future retrieval of that dataset.
    • Profiler: this service downloads a discovered dataset and computes additional metadata that can be used for search (for example, dimensions, semantic types, value distributions). Uses the profiling and materialization libraries.
    • Lazo Server: this service is responsible for indexing textual and categorical attributes using Lazo. The code for the server and client is available here.
    • apiserver: this service responds to requests from clients to search for datasets in the index (triggering on-demand query by discovery services that support it), upload new datasets, profile datasets, or perform augmentation. Uses the profiling and materialization libraries. Implements a JSON API using the Tornado web framework.
    • The cache-cleaner: this service makes sure the dataset cache stays under a given size limit by removing least-recently-used datasets when the configured size is reached.
    • The coordinator: this service collects some metrics and offers a maintenance interface for the system administrator.
    • The frontend: this is a React app implementing a user-friendly web interface on top of the API.

Auctus Architecture

Elasticsearch is used as the search index, storing one document per known dataset.

The services exchange messages through RabbitMQ, allowing us to have complex messaging patterns with queueing and retrying semantics, and complex patterns such as the on-demand querying.

AMQP Overview

Deployment

The system is currently running at https://auctus.vida-nyu.org/. You can see the system status at https://grafana.auctus.vida-nyu.org/.

Local deployment / development setup

To deploy the system locally using docker-compose, follow those step:

Set up environment

Make sure you have checked out the submodule with git submodule init && git submodule update

Make sure you have Git LFS installed and configured (git lfs install)

Copy env.default to .env and update the variables there. You might want to update the password for a production deployment.

Make sure your node is set up for running Elasticsearch. You will probably have to raise the mmap limit.

The API_URL is the URL at which the apiserver containers will be visible to clients. In a production deployment, this is probably a public-facing HTTPS URL. It can be the same URL that the "coordinator" component will be served at if using a reverse proxy (see nginx.conf).

To run scripts locally, you can load the environment variables into your shell by running: . scripts/load_env.sh (that's dot space scripts...)

Prepare data volumes

Run scripts/setup.sh to initialize the data volumes. This will set the correct permissions on the volumes/ subdirectories.

Should you ever want to start from scratch, you can delete volumes/ but make sure to run scripts/setup.sh again afterwards to set permissions.

Build the containers

$ docker-compose build --build-arg version=$(git describe) apiserver

Start the base containers

$ docker-compose up -d elasticsearch rabbitmq redis minio lazo

These will take a few seconds to get up and running. Then you can start the other components:

$ docker-compose up -d cache-cleaner coordinator profiler apiserver apilb frontend

You can use the --scale option to start more profiler or apiserver containers, for example:

$ docker-compose up -d --scale profiler=4 --scale apiserver=8 cache-cleaner coordinator profiler apiserver apilb frontend

Ports:

Import a snapshot of our index (optional)

$ scripts/docker_import_snapshot.sh

This will download an Elasticsearch dump from auctus.vida-nyu.org and import it into your local Elasticsearch container.

Start discovery plugins (optional)

$ docker-compose up -d socrata zenodo

Start metric dashboard (optional)

$ docker-compose up -d elasticsearch_exporter prometheus grafana

Prometheus is configured to automatically find the containers (see prometheus.yml)

A custom RabbitMQ image is used, with added plugins (management and prometheus).

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].