All Projects → dstlry → dstlr

dstlry / dstlr

Licence: other
scalable knowledge graph construction from unstructured text

Programming Languages

scala
5932 projects
shell
77523 projects

Projects that are alternatives of or similar to dstlr

social-graph-api
Authentication & Social Graph API built on top of Redis, Neo4J and Play!
Stars: ✭ 13 (-84.15%)
Mutual labels:  neo4j
flask-graphql-neo4j
A simple flask API to test-drive GraphQL and Neo4j
Stars: ✭ 74 (-9.76%)
Mutual labels:  neo4j
Graph-OLAP
An attempt to model an OLAP cube with Neo4j.
Stars: ✭ 37 (-54.88%)
Mutual labels:  neo4j
neo4j-graphql-java
Pure JVM translation for GraphQL queries and mutations to Neo4j's Cypher
Stars: ✭ 94 (+14.63%)
Mutual labels:  neo4j
neo4j-bloom
A public repository for informal docs, problem reporting and content sharing related to Neo4j Bloom.
Stars: ✭ 15 (-81.71%)
Mutual labels:  neo4j
cotect
🛡Crowd-sourced COVID-19 reporting and assessment system.
Stars: ✭ 14 (-82.93%)
Mutual labels:  neo4j
instacart-neo4j
Playing with Instacart data in Neo4j
Stars: ✭ 16 (-80.49%)
Mutual labels:  neo4j
py2neo
Py2neo is a comprehensive Neo4j driver library and toolkit for Python.
Stars: ✭ 1,105 (+1247.56%)
Mutual labels:  neo4j
neo4j-aws-causal-cluster
Neo4j Enterprise Causal Cluster on AWS ECS by GetSocial
Stars: ✭ 24 (-70.73%)
Mutual labels:  neo4j
NeoClient
🦉 Lightweight OGM for Neo4j which support transactions and BOLT protocol.
Stars: ✭ 21 (-74.39%)
Mutual labels:  neo4j
neo4-js
Neo4-js is a object-graph mapper for JavaScript and neo4j with full flow-type support.
Stars: ✭ 19 (-76.83%)
Mutual labels:  neo4j
neo4j-serverless-functions
google cloud functions for ingesting data into neo4j
Stars: ✭ 17 (-79.27%)
Mutual labels:  neo4j
liquigraph
Migrations for Neo4j
Stars: ✭ 122 (+48.78%)
Mutual labels:  neo4j
prov-db-connector
PROV Database Connector
Stars: ✭ 15 (-81.71%)
Mutual labels:  neo4j
neo4j-graph-renderer
A React.js component that allows you to render neo4j graphs
Stars: ✭ 45 (-45.12%)
Mutual labels:  neo4j
neo4j-java-driver-spring-boot-starter
Automatic configuration of Neo4j's Java Driver for Spring Boot applications
Stars: ✭ 33 (-59.76%)
Mutual labels:  neo4j
jstarcraft-nlp
专注于解决自然语言处理领域的几个核心问题:词法分析,句法分析,语义分析,语种检测,信息抽取,文本聚类和文本分类. 为相关领域的研发人员提供完整的通用设计与参考实现. 涵盖了多种自然语言处理算法,适配了多个自然语言处理框架. 兼容Lucene/Solr/ElasticSearch插件.
Stars: ✭ 92 (+12.2%)
Mutual labels:  corenlp
node-corenlp
CoreNLP @ NodeJS
Stars: ✭ 63 (-23.17%)
Mutual labels:  corenlp
turing
✨ 🧬 Turing AI - Semantic Navigation, Chatbot using Search Engine and Many NLP Vendors.
Stars: ✭ 30 (-63.41%)
Mutual labels:  corenlp
chatbot
kbqa task-oriented qa seq2seq ir neo4j jena seq2seq tf chatbot chat
Stars: ✭ 32 (-60.98%)
Mutual labels:  neo4j

dstlr

dstlr is an open-source platform for scalable, end-to-end knowledge graph construction from unstructured text. The platform takes a collection of documents, extracts mentions and relations to populate a raw knowledge graph, links mentions to entities in Wikidata, and then enriches the knowledge graph with facts from Wikidata. See dstlr.ai for an overview of the platform.

The current dstlr demo "distills" the TREC Washington Post Corpus containing around 600K documents into a raw knowledge graph comprised of approximately 97M triples, enriched with facts from Wikidata for the 324K distinct entities discovered in the corpus. On top of this knowledge graph, we have implemented a subgraph-matching approach to align extracted relations with facts from Wikidata using the declarative Cypher query language. This simple demo shows that fact verification, locating textual support for asserted facts, detecting inconsistent and missing facts, and extracting distantly-supervised training data can all be performed within the same framework.

This README provies instructions on how to replicate our work.

Setup

Clone dstlr:

git clone https://github.com/dstlry/dstlr.git

sbt is the build tool used for Scala projects, download it if you don't have it yet.

Build the JAR using sbt:

sbt assembly

There is a known issue between recent Spark versions and CoreNLP 3.8. To fix this, delete the protobuf-java-2.5.0.jar file in $SPARK_HOME/jars and replace it with version 3.0.0.

Anserini

Download and build Anserini.

Follow the Solrini instructions to set up a SolrCloud instance and index a document collection into SolrCloud, such as the TREC Washington Post Corpus.

neo4j

Start a neo4j instance via Docker with the command:

docker run -d --name neo4j --publish=7474:7474 --publish=7687:7687 \
    --volume=`pwd`/neo4j:/data \
    -e NEO4J_dbms_memory_pagecache_size=2G \
    -e NEO4J_dbms_memory_heap_initial__size=4G \
    -e NEO4J_dbms_memory_heap_max__size=16G \
    neo4j

Note: You may wish to update the memory settings based on the amount of available memory on your machine.

neo4j should should be available shortly at http://localhost:7474/ with the default username/password of neo4j/neo4j. You will be prompted to change the password, this is the password you will pass to the load script.

Running

Extraction

For each document in the collection, we extract mentions of named entities, the relations between them, and links to entities in an external knowledge graph.

Run ExtractTriples using default options:

./bin/extract.sh

Note: Modify extract.sh based on your environment (e.g., available memory, number of executors, Solr, neo4j password, etc.) - options available here.

For example, the following command does the query music on index core18, then apply dstlr to the top 5 hits.

spark-submit --class io.dstlr.ExtractTriples \
        --num-executors 32 --executor-cores 8 \
        --driver-memory 64G --executor-memory 48G \
        --conf spark.executor.heartbeatInterval=10000 \
        --conf spark.executorEnv.JAVA_HOME=/usr/lib/jvm/java-9-openjdk-amd64 \
        target/scala-2.11/dstlr-assembly-0.1.jar \
        --solr.uri localhost:9983 --solr.index core18 --max_rows 5 --query contents:music --partitions 2048 --output triples --sent-length-threshold 256

After the extraction is done, check if an output folder (called triples/ by default) is created, and several Parquet files are generated inside the output folder.

If you want to inspect the Parquet file:

Note: If you are on Mac, you could also install it with Homebrew brew install parquet-tools.

  • View the Parquet file in JSON format:
parquet-tools cat --json [filename]

Enrichment

We augment the raw knowledge graph with facts from the external knowledge graph (Wikidata in our case).

Run EnrichTriples:

./bin/enrich.sh

Note: Modify enrich.sh based on your environment.

After the enrichment is done, check if an output folder (called triples-enriched/ by default) is created with output Parquet files.

Load

Load raw knowledge graph and enriched knowledge graph produced from the above commands to neo4j.

Set --input triples in load.sh, run LoadTriples:

./bin/load.sh

Note: Modify load.sh based on your environment.

Set --input triples-enriched in load.sh, run LoadTriples again:

./bin/load.sh

Open http://localhost:7474/ to view the loaded knowledge graph in neo4j.

Data Cleaning Queries

The following queries can be run against the knowledge graph in neo4j to discover sub-graphs of interest.

Supporting Information

This query finds sub-graphs where the value extracted from the document matches the ground-truth from Wikidata.

MATCH (d:Document)-->(s:Mention)-->(r:Relation)-->(o:Mention)
MATCH (s)-->(e:Entity)-->(f:Fact {relation: r.type})
WHERE o.span = f.value
RETURN d, s, r, o, e, f

In order to see only sub-graphs with a specific relationship such as "city of headquaters", run

MATCH (d:Document)-->(s:Mention)-->(r:Relation {type: "ORG_CITY_OF_HEADQUARTERS"})-->(o:Mention)
MATCH (s)-->(e:Entity)-->(f:Fact {relation: r.type})
WHERE o.span = f.value
RETURN d, s, r, o, e, f

Inconsistent Information

This query finds sub-graphs where the value extracted from the document does not match the ground-truth from Wikidata.

MATCH (d:Document)-->(s:Mention)-->(r:Relation)-->(o:Mention)
MATCH (s)-->(e:Entity)-->(f:Fact {relation: r.type})
WHERE NOT(o.span = f.value)
RETURN d, s, r, o, e, f

Missing Information

This query finds sub-graphs where the value extracted from the document does not have a corresponding ground-truth in Wikidata.

MATCH (d:Document)-->(s:Mention)-->(r:Relation)-->(o:Mention)
MATCH (s)-->(e:Entity)
OPTIONAL MATCH (e)-->(f:Fact {relation: r.type})
WITH d, s, r, o, e, f
WHERE f IS NULL
RETURN d, s, r, o, e, f

Delete Relationships

This query deletes all relationships in the database.

MATCH (n) DETACH DELETE n
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].