All Projects → SANSA-Stack → Archived-SANSA-Query

SANSA-Stack / Archived-SANSA-Query

Licence: Apache-2.0 license
SANSA Query Layer

Programming Languages

scala
5932 projects
java
68154 projects - #9 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to Archived-SANSA-Query

SANSA-Stack
Big Data RDF Processing and Analytics Stack built on Apache Spark and Apache Jena http://sansa-stack.github.io/SANSA-Stack/
Stars: ✭ 130 (+319.35%)
Mutual labels:  rdf, distributed-computing, flink
pyspark-algorithms
PySpark Algorithms Book: https://www.amazon.com/dp/B07X4B2218/ref=sr_1_2
Stars: ✭ 72 (+132.26%)
Mutual labels:  distributed-computing, partitioning
sparklis
Sparklis is a query builder in natural language that allows people to explore and query SPARQL endpoints with all the power of SPARQL and without any knowledge of SPARQL.
Stars: ✭ 28 (-9.68%)
Mutual labels:  sparql, rdf
open-stream-processing-benchmark
This repository contains the code base for the Open Stream Processing Benchmark.
Stars: ✭ 37 (+19.35%)
Mutual labels:  distributed-computing, flink
SolRDF
An RDF plugin for Solr
Stars: ✭ 115 (+270.97%)
Mutual labels:  sparql, rdf
skos-play
SKOS-Play allows to print SKOS files in HTML or PDF. It also embeds xls2rdf to generate RDF from Excel.
Stars: ✭ 58 (+87.1%)
Mutual labels:  sparql, rdf
everything
The semantic desktop search engine
Stars: ✭ 22 (-29.03%)
Mutual labels:  sparql, rdf
ontobio
python library for working with ontologies and ontology associations
Stars: ✭ 104 (+235.48%)
Mutual labels:  sparql, rdf
semantic-python-overview
(subjective) overview of projects which are related both to python and semantic technologies (RDF, OWL, Reasoning, ...)
Stars: ✭ 406 (+1209.68%)
Mutual labels:  sparql, rdf
amazon-neptune-csv-to-rdf-converter
Amazon Neptune CSV to RDF Converter is a tool for Amazon Neptune that converts property graphs stored as comma separated values into RDF graphs.
Stars: ✭ 27 (-12.9%)
Mutual labels:  sparql, rdf
tentris
Tentris is a tensor-based RDF triple store with SPARQL support.
Stars: ✭ 34 (+9.68%)
Mutual labels:  sparql, rdf
corese
Software platform implementing and extending the standards of the Semantic Web.
Stars: ✭ 55 (+77.42%)
Mutual labels:  sparql, rdf
LD-Connect
LD Connect is a Linked Data portal for IOS Press in collaboration with the STKO Lab at UC Santa Barbara.
Stars: ✭ 0 (-100%)
Mutual labels:  sparql, rdf
matcha
🍵 SPARQL-like DSL for querying in memory Linked Data Models
Stars: ✭ 18 (-41.94%)
Mutual labels:  sparql, rdf
QuitStore
🖧 Quads in Git - Distributed Version Control for RDF Knowledge Bases
Stars: ✭ 87 (+180.65%)
Mutual labels:  sparql, rdf
OLGA
an Ontology SDK
Stars: ✭ 36 (+16.13%)
Mutual labels:  sparql, rdf
pyfuseki
A library that uses Python to connect and manipulate Jena Fuseki, which provides sync and async methods.
Stars: ✭ 22 (-29.03%)
Mutual labels:  sparql, rdf
trio
Datatype agnostic triple store & query engine API
Stars: ✭ 78 (+151.61%)
Mutual labels:  sparql, rdf
sparql-proxy
SPARQL-proxy: provides cache, job control, and logging for any SPARQL endpoint
Stars: ✭ 26 (-16.13%)
Mutual labels:  sparql, rdf
Processor
Ontology-driven Linked Data processor and server for SPARQL backends. Apache License.
Stars: ✭ 54 (+74.19%)
Mutual labels:  sparql, rdf

Archived Repository - Do not use this repository anymore!

SANSA got easier to use! All its code has been consolidated into a single repository at https://github.com/SANSA-Stack/SANSA-Stack

SANSA Query

Maven Central Build Status License Twitter

Description

SANSA Query is a library to perform SPARQL queries over RDF data using big data engines Spark and Flink. It allows to query RDF data that resides both in HDFS and in a local file system. Queries are executed distributed and in parallel across Spark RDDs/DataFrames or Flink DataSets. Further, SANSA-Query can query non-RDF data stored in databases e.g., MongoDB, Cassandra, MySQL or file format Parquet, using Spark.

For RDF data, SANSA uses vertical partitioning (VP) approach and is designed to support extensible partitioning of RDF data. Instead of dealing with a single triple table (s, p, o), data is partitioned into multiple tables based on the used RDF predicates, RDF term types and literal datatypes. The first column of these tables is always a string representing the subject. The second column always represents the literal value as a Scala/Java datatype. Tables for storing literals with language tags have an additional third string column for the language tag. Its uses Sparqlify as a scalable SPARQL-SQL rewriter.

For heterogeneous data sources (data lake), SANSA uses virtual property tables (PT) partitioning, whereby data relevant to a query is loaded on the fly into Spark DataFrames composed of attributes corresponding to the properties of the query.

SANSA Query SPARK - RDF

On SANSA Query Spark for RDF the method for partitioning an RDD[Triple] is located in RdfPartitionUtilsSpark. It uses an RdfPartitioner which maps a Triple to a single RdfPartition instance.

  • RdfPartition - as the name suggests, represents a partition of the RDF data and defines two methods:
    • matches(Triple): Boolean: This method is used to test whether a triple fits into a partition.
    • layout: TripleLayout: This method returns the TripleLayout associated with the partition, as explained below.
    • Furthermore, RdfPartitions are expected to be serializable, and to define equals and hash code.
  • TripleLayout instances are used to obtain framework-agnostic compact tabular representations of triples according to a partition. For this purpose it defines the two methods:
    • fromTriple(triple: Triple): Product: This method must, for a given triple, return its representation as a Product (this is the super class of all Scala tuples)
    • schema: Type: This method must return the exact Scala type of the objects returned by fromTriple, such as typeOf[Tuple2[String,Double]]. Hence, layouts are expected to only yield instances of one specific type.

See the available layouts for details.

SANSA Query SPARK - Heterogeneous Data Sources

SANSA Query Spark for heterogeneous data sources (data data) is composed of three main components:

  • Anlyser: it extracts SPARQL triple patters and groups them by subject, it also extracts any operation on subjects like filters, group by, order by, distinct, limit.
  • ِPlanner: it extracts joins between subject-based triple patter groups and generates join plan accordingly. The join order followed is left-deep.
  • Mapper: it access (RML) mappings and matches properties of a subject-based triples patter group against the attributes of individual data sources. If a match exists of every property of the triple pattern, the respective data source is declared relavant and loaded into Spark DataFrame. The loading into DataFrame is performed using Spark Connectors.
  • Executor: it analyses SPARQL query and generates equivalent Spark SQL functions over DataFrames, for SELECT, WHERE, GROUP-BY, ORDER-BY, LIMIT. Connection between subject-based triple pattern groups are translated into JOINs between relevant Spark DataFrames.

Usage

The following Scala code shows how to query an RDF file SPARQL syntax (be it a local file or a file residing in HDFS):

val spark: SparkSession = ...

val lang = Lang.NTRIPLES
val triples = spark.rdf(lang)("path/to/rdf.nt")


val partitions = RdfPartitionUtilsSpark.partitionGraph(triples)
val rewriter = SparqlifyUtils3.createSparqlSqlRewriter(spark, partitions)

val qef = new QueryExecutionFactorySparqlifySpark(spark, rewriter)

val port = 7531
val server = FactoryBeanSparqlServer.newInstance.setSparqlServiceFactory(qef).setPort(port).create()
server.join()

An overview is given in the FAQ section of the SANSA project page. Further documentation about the builder objects can also be found on the ScalaDoc page.

For querying heterogeneous data sources, refer to the documentation of the dedicated SANSA-DatLake component.

How to Contribute

We always welcome new contributors to the project! Please see our contribution guide for more details on how to get started contributing to SANSA.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].