All Projects → fractalego → pynsett

fractalego / pynsett

Licence: MIT license
A programmable relation extraction tool

Programming Languages

python
139335 projects - #7 most used programming language
HTML
75241 projects
javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to pynsett

nlp-cheat-sheet-python
NLP Cheat Sheet, Python, spacy, LexNPL, NLTK, tokenization, stemming, sentence detection, named entity recognition
Stars: ✭ 69 (+176%)
Mutual labels:  spacy
deplacy
CUI-based Tree Visualizer for Universal Dependencies and Immediate Catena Analysis
Stars: ✭ 97 (+288%)
Mutual labels:  spacy
bert-tensorflow-pytorch-spacy-conversion
Instructions for how to convert a BERT Tensorflow model to work with HuggingFace's pytorch-transformers, and spaCy. This walk-through uses DeepPavlov's RuBERT as example.
Stars: ✭ 26 (+4%)
Mutual labels:  spacy
anonymization-api
How to build and deploy an anonymization API with FastAPI
Stars: ✭ 51 (+104%)
Mutual labels:  spacy
Relation-Classification
Relation Classification - SEMEVAL 2010 task 8 dataset
Stars: ✭ 46 (+84%)
Mutual labels:  relation-extraction
spacy-server
🦜 Containerized HTTP API for industrial-strength NLP via spaCy and sense2vec
Stars: ✭ 58 (+132%)
Mutual labels:  spacy
SkillNER
A (smart) rule based NLP module to extract job skills from text
Stars: ✭ 69 (+176%)
Mutual labels:  spacy
InformationExtractionSystem
Information Extraction System can perform NLP tasks like Named Entity Recognition, Sentence Simplification, Relation Extraction etc.
Stars: ✭ 27 (+8%)
Mutual labels:  relation-extraction
SMMT
Social Media Mining Toolkit (SMMT) main repository
Stars: ✭ 116 (+364%)
Mutual labels:  spacy
Shukongdashi
使用知识图谱,自然语言处理,卷积神经网络等技术,基于python语言,设计了一个数控领域故障诊断专家系统
Stars: ✭ 109 (+336%)
Mutual labels:  relation-extraction
ling
Natural Language Processing Toolkit in Golang
Stars: ✭ 57 (+128%)
Mutual labels:  spacy
m3gm
Max-Margin Markov Graph Models for WordNet (EMNLP 2018)
Stars: ✭ 40 (+60%)
Mutual labels:  relation-extraction
airy
💬 Open source conversational platform to power conversations with an open source Live Chat, Messengers like Facebook Messenger, WhatsApp and more - 💎 UI from Inbox to dashboards - 🤖 Integrations to Conversational AI / NLP tools and standard enterprise software - ⚡ APIs, WebSocket, Webhook - 🔧 Create any conversational experience
Stars: ✭ 299 (+1096%)
Mutual labels:  spacy
CogIE
CogIE: An Information Extraction Toolkit for Bridging Text and CogNet. ACL 2021
Stars: ✭ 47 (+88%)
Mutual labels:  relation-extraction
spacy-french-models
French models for spacy
Stars: ✭ 22 (-12%)
Mutual labels:  spacy
relation-extraction-rnn
Bi-directional LSTM model for relation extraction
Stars: ✭ 22 (-12%)
Mutual labels:  relation-extraction
alter-nlu
Natural language understanding library for chatbots with intent recognition and entity extraction.
Stars: ✭ 45 (+80%)
Mutual labels:  spacy
DaCy
DaCy: The State of the Art Danish NLP pipeline using SpaCy
Stars: ✭ 66 (+164%)
Mutual labels:  spacy
spacy hunspell
✏️ Hunspell extension for spaCy 2.0.
Stars: ✭ 94 (+276%)
Mutual labels:  spacy
agile
🌌 Global State and Logic Library for JavaScript/Typescript applications
Stars: ✭ 90 (+260%)
Mutual labels:  spacy

Pynsett: A programmable relation extraction tool

Installation

Before installing the package you need to install the tools for compiling python-igraph

sudo apt-get install build-essential python-dev python3-dev

The basic version can be installed by typing

virtualenv --python=/usr/bin/python3 .env
pip install pynsett

The system is now installed, however the parser requires an additional module from Spacy and AllenNLP. You will need to type

python3 -m spacy download en_core_web_lg
python3 -m pynsett download

A working Docker image can be found here.

What is Pynsett

Pynsett is a programmable relation extractor. The user sets up a set of rules which are used to parse any English text. As a result, Pynsett returns a list of triplets as defined in the rules. A short paper describing the system has been published at SEMAPRO2020.

Example usage

Let's assume we want to extract wikidata relations from a file named 'test.txt'. An example code would be

from pynsett.discourse import Discourse
from pynsett.extractor import Extractor
from pynsett.auxiliary.prior_knowedge import get_wikidata_knowledge


text = open('test.txt').read()
discourse = Discourse(text)

extractor = Extractor(discourse, get_wikidata_knowledge())
triplets = extractor.extract()

for triplet in triplets:
    print(triplet)

The distribution comes with two sets of rules: The generic knowledge, accessible using pynsett.auxiliary.prior_knowledge.get_generic_knowledge(), and the wikidata knowledge, which can be loaded using pynsett.auxiliary.prior_knowledge.get_wikidata_knowledge()

Create new rules for extraction

Let's assume we are writing a new file called "my_own_rules.rules". An example of a new set of rules can be the following:

MATCH "Jane#1 is an engineer#2"
CREATE (HAS_ROLE 1 2);

Here the symbol #1 gives a label to 'Jane' and #2 gives a label to 'engineer'. These labels can be used when creating the relation '(IS_A 1 2)'.

A more generic rule uses the entity types (Jane is a PERSON)

MATCH "{PERSON}#1 is an engineer#2"
CREATE (HAS_ROLE 1 2);

This rule matches all the sentences where the subject is a person (compatibly with the internal NER). The name of the person is associated to the node.

There are 18 entity types that you can type within brackets: CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, WORK_OF_ART

There you go, a person is now connected with a role: Node 1 is whoever matches for node 1 and the profession is "engineer". The properties of the words are put into node 1 and 2.

This seems a little bit limiting, because the previous relations only works for engineers. Let us define a word cloud and call it "ROLE".

DEFINE ROLE AS [engineer, architect, physicist, doctor];

MATCH "{PERSON}#1 is a ROLE#2"
CREATE (HAS_ROLE 1 2);

As a final touch let us make the text a little bit nicer to the eyes: Let's use PERSON instead of {PERSON}

DEFINE PERSON AS {PERSON};
DEFINE ROLE AS [engineer, architect, physicist, doctor];

MATCH "PERSON#1 is a ROLE#2"
CREATE (HAS_ROLE 1 2);

A working example of pynsett's rules is in this file.

Use the extraction rules

If you have a specific file with the extraction rules, you can load it by creating a new Knowledge object:

from pynsett.discourse import Discourse
from pynsett.extractor import Extractor
from pynsett.knowledge import Knowledge


text = open('test.txt').read()
discourse = Discourse(text)

knowledge = Knowledge()
knowledge.add_rules(open('./my_own_rules.rules').read())

extractor = Extractor(discourse, knowledge)
triplets = extractor.extract()

for triplet in triplets:
    print(triplet)

Import the triplets into Neo4J

The triplets can be imported into a proper graph database. As an example, let us do it for Neo4j.
You would need to install the system onto your machine, as well as installing the python package 'py2neo'. After everything is set up, you can run the following script.

from py2neo import Graph
from pynsett.discourse import Discourse
from pynsett.extractor import Extractor
from pynsett.auxiliary.prior_knowedge import get_wikidata_knowledge

knowledge = get_wikidata_knowledge()
text = open('sample_wikipedia.txt').read()

discourse = Discourse(text)
extractor = Extractor(discourse, knowledge)
triplets = extractor.extract()

graph = Graph('http://localhost:7474/db/data/')
for triplet in triplets:
    graph.run('MERGE (a {text: "%s"}) MERGE (b {text: "%s"}) CREATE (a)-[:%s]->(b)'
              % (triplet[0],
                 triplet[2],
                 triplet[1]))

This script works on an example page called 'sample_wikipedia.txt' that you will have to provide.

Using the internal Web Server

To start the internal web server you can write the following

from pynsett.server import pynsett_app
pynsett_app.run(debug=True, port=4001, host='0.0.0.0', use_reloader=False)

which will open a flask app at localhost:4001.

Web interface

The server provides three web interfaces:

A Wikidata relation extractor at http://localhost:4001/wikidata

Image about Asimov's Wikipedia page

A Programmable relation extractor at http://localhost:4001/relations

Image about a programmable rule

A Neo-Davidsonian representation of a text at http://localhost:4001

Image about A Neo-Davidsonian representation

API

The wikidata relation extractor API can be called with

import json
import requests

text = "John is a writer."
triplets = json.loads(requests.post('http://localhost:4001/api/wikidata', json={'text': text}).text)
print(triplets)

with output:

[['John', 'JOB_TITLE', 'writer']]

The rules can programmed by posting as in the following

import json
import requests

rules = """
DEFINE PERSON AS {PERSON};
DEFINE ORG AS {ORG};
DEFINE ROLE AS [engineer, author, doctor, researcher];

MATCH "PERSON#1 was ROLE at ORG#2"
CREATE (WORKED_AT 1 2);
"""

triplets = json.loads(requests.post('http://localhost:4001/api/set_rules', json={'text': rules}).text)

These rules are then used at the following API endpoint

import json
import requests

text = "Isaac Asimov was an American writer and professor of biochemistry at Boston University."
triplets = json.loads(requests.post('http://localhost:4001/api/relations', json={'text': text}).text)
print(triplets)

The Neo-Davidsonian representation API can be called with

import json
import requests
text = "John is tall."
graph = json.loads(requests.post('http://localhost:4001/api/drt', json={'text': text}).text)
print(graph)

with output:

{'edges': [{'arrows': 'to', 'from': 'v1', 'label': 'AGENT', 'to': 'v0'},
                                       {'arrows': 'to', 'from': 'v1', 'label': 'ADJECTIVE', 'to': 'v2'}],
                             'nodes': [{'id': 'v1', 'label': 'is'},
                                       {'id': 'v0', 'label': 'John'},
                                       {'id': 'v2', 'label': 'tall'}]}

Pre-Formatting of the Text

The text must be submitted respecting the following rules:

  • No parenthesis (...) nor brackets [...]. The parser is confused by those.
  • The paragraphs must be separated by 1 empty line. Dividing a text into paragraphs helps with anaphora.
    This is paragraph 1.
    
    This is paragraph 2.

Known issues and shortcomings

  • Speed! Parsing is done one sentence at a time
  • Anaphora only works inside paragraphs
  • Anaphora is done through AllenNLP, with can be slow-ish without a GPU
  • The text needs to be cleaned and pre-formatted. This is not an issue per se but it must be kept in mind

Citation

Please cite the paper as

@INPROCEEDINGS{Cetoli2020-Pynsett,
  title           = "Pynsett: A programmable relation extractor",
  booktitle       = "The Fourteenth International Conference on Advances in Semantic Processing (SEMAPRO 2020)",
  author          = "Cetoli, Alberto",
  editor          = "{Tim vor der Br{\"u}ck}",
  publisher       = "ThinkMind Digital Library",
  pages           = "45 to 48",
  month           =  oct,
  year            =  2020,
  address         = "Nice, France",
  language        = "en",
  isbn            = "978-1-61208-813-6",
  issn            = "2308-4510",
  howpublished    = "\url{https://www.thinkmind.org/index.php?view=article&articleid=semapro_2020_2_40_30017}"
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].