All Projects → informagi → GeeseDB

informagi / GeeseDB

Licence: MIT License
Graph Engine for Exploration and Search

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to GeeseDB

SWDM
SIGIR 2017: Embedding-based query expansion for weighted sequential dependence retrieval model
Stars: ✭ 35 (+150%)
Mutual labels:  information-retrieval
Paging-3-Sample
This app is created as a sample app which loads movies from Tmdb api and uses Paging 3 library to show it in a Recycler view.
Stars: ✭ 96 (+585.71%)
Mutual labels:  databases
cdQA-ui
⛔ [NOT MAINTAINED] A web interface for cdQA and other question answering systems.
Stars: ✭ 19 (+35.71%)
Mutual labels:  information-retrieval
cherche
📑 Neural Search
Stars: ✭ 196 (+1300%)
Mutual labels:  information-retrieval
radb
RA (radb): A relational algebra interpreter over relational databases
Stars: ✭ 48 (+242.86%)
Mutual labels:  databases
SER-datasets
A collection of datasets for the purpose of emotion recognition/detection in speech.
Stars: ✭ 74 (+428.57%)
Mutual labels:  databases
see
Search Engine in Erlang
Stars: ✭ 27 (+92.86%)
Mutual labels:  information-retrieval
graphgen-project
A Python wrapper over the GraphGen system
Stars: ✭ 31 (+121.43%)
Mutual labels:  databases
text2text
Text2Text: Cross-lingual natural language processing and generation toolkit
Stars: ✭ 188 (+1242.86%)
Mutual labels:  information-retrieval
microservices-datadriven
Sample code of application examples to build microservices with converged Oracle database and multi-cloud / hybrid cloud services
Stars: ✭ 28 (+100%)
Mutual labels:  databases
awesome-semantic-search
A curated list of awesome resources related to Semantic Search🔎 and Semantic Similarity tasks.
Stars: ✭ 161 (+1050%)
Mutual labels:  information-retrieval
autocomplete
Efficient and effective query auto-completion in C++.
Stars: ✭ 28 (+100%)
Mutual labels:  information-retrieval
PersianStemmer-Python
PersianStemmer-Python
Stars: ✭ 43 (+207.14%)
Mutual labels:  information-retrieval
nakal
A MySQL backup tool for Google Sheets, written in Node.js.
Stars: ✭ 14 (+0%)
Mutual labels:  databases
tbd.rs
Totally Badass Databases in Rust - an experiment
Stars: ✭ 19 (+35.71%)
Mutual labels:  databases
wsdm-digg-2020
No description or website provided.
Stars: ✭ 15 (+7.14%)
Mutual labels:  information-retrieval
Knowledge Graph based Intent Network
Learning Intents behind Interactions with Knowledge Graph for Recommendation, WWW2021
Stars: ✭ 116 (+728.57%)
Mutual labels:  information-retrieval
SolrConfigExamples
Examples of Solr configuration entries for Solr plugins and Conceptual Search\Semantic Search from Simon Hughes Dice.com
Stars: ✭ 26 (+85.71%)
Mutual labels:  information-retrieval
RelevancyTuning
Dice.com tutorial on using black box optimization algorithms to do relevancy tuning on your Solr Search Engine Configuration from Simon Hughes Dice.com
Stars: ✭ 28 (+100%)
Mutual labels:  information-retrieval
orm-qt
Object Relation Mapping with Qt library
Stars: ✭ 32 (+128.57%)
Mutual labels:  databases

GeeseDB

Build Status

Graph Engine for Exploration and Search

GeeseDB is a Python toolkit for solving information retrieval research problems that leverage graphs as data structures. It aims to simplify information retrieval research by allowing researchers to easily formulate graph queries through a graph query language. GeeseDB is built on top of DuckDB, an embedded column-store relational database designed for analytical workloads.

GeeseDB is available as an easy to install Python package. In only a few lines of code users can create a first stage retrieval ranking using BM25. Queries read and write Numpy arrays and Pandas dataframes, at zero or negligible data transformation cost (dependent on base datatype). Therefore, results of a first-stage ranker expressed in GeeseDB can be used in various stages in the ranking process, enabling all the power of Python machine learning libraries with minimal overhead. Also, because data representation and processing are strictly separated, GeeseDB forms an ideal basis for reproducible IR research.

Package Installation

Install latest version of GeeseDB via PyPI:

pip install geesedb==0.0.2

GeeseDB depends on a couple of packages that can also be installed using pip. It is also possible to install the development installation of GeeseDB using pip:

pip install git+https://github.com/informagi/GeeseDB.git

If you are planning to contribute to the package it is possible to clone the package, and install it using pip in editable version:

git clone [email protected]:informagi/GeeseDB.git && cd GeeseDB && pip install -e .

You can run our tests to confirm if everything is working as intended (in the repository folder):

pytest

How do I index?

The fastest way to load text data into GeeseDB is through CSV files. There should be three csv files: one for terms, one for documents, and one that connects the terms to the documents. Small examples of these files can be found in the repository: docs.csv, terms_dics.csv, and term_doc.csv.

These can be generated using the CIFF to_csv class from CIFF collections, or you can create them however you like. The documents can be loaded using the following code:

from geesedb.index import FullTextFromCSV

index = FullTextFromCSV(
    database='/path/to/database',
    docs_file='/path/to/docs.csv',
    term_dict_file='/path/to/term_dict.csv',
    term_doc_file='/path/to/term_doc.csv'
)
index.load_data()

How do I search?

After indexing in the data, it is really easy to construct a first stage ranking using BM25:

from geesedb.search import Searcher

searcher = Searcher(
    database='/path/to/database', 
    n=10
)
hits = searcher.search_topic('cat')

In this case the searcher returns the top 10 documents for the query: cat.

How can I use SQL with GeeseDB?

GeeseDB is built on top of DuckDB, and we inherit all its functionalities. It is possible to directly query the data in GeeseDB using SQL. The following example shows an example on how to use SQL on the data loaded in the example above:

from geesedb.connection import get_connection

db_path = '/path/to/database/'
cursor = get_connection(db_path)
cursor.execute("SELECT count(*) FROM docs;")
cursor.fetchall()

How can I use Cypher with GeeseDB

GeeseDB also supports a subset of the Cypher graph query language, in particular the following keywords: MATCH, RETURN, WHERE, AND, DISTINCT, ORDER BY, SKIP, and LIMIT. We plan to support the full Cypher query langauge in the future. In order to use the Cypher query language with GeeseDB, first a metadata file needs to be loaded.

The metadata represents the graph structure represented in the database, the table name _meta is used for this. The metadata is represented as a Python dictionary object with the following structure:

{
    'from_node':
    {
        'to_node':
        {
            [['join_table',
              'from_node_join_key',
              'join_table_from_node_join_key',
              'join_table_to_node_join_key',
              'to_node_join_key'
              ]]
        }
    }
}

Using this structure we know which tables in the database related to eachother. If this information is known it is possible to translate Cypher queries to SQL queries. An example of a Cypher query that can be translated to SQL is shown belows:

Cypher:

MATCH (d:docs)-[]-(:authors)-[]-(d2:docs)
WHERE d.collection_id = "96ab542e"
RETURN DISTINCT d2.collection_id

SQL:

SELECT DISTINCT d2.collection_id
FROM docs AS d2
JOIN doc_author AS da2 ON (d2.collection_id = da2.doc)
JOIN authors AS a2 ON (da2.author = a2.author)
JOIN doc_author AS da3 ON (a2.author = da3.author)
JOIN docs AS d ON (d.collection_id = da3.doc)
WHERE d.collection_id = '96ab542e'

The queries can be translated the following way:

from geesedb.interpreter import Translator

c_query = "cypher query"
translator = Translator('path/to/database')
sql_query = translator.translate(c_query)
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].