All Projects → jakelever → VERSE

jakelever / VERSE

Licence: MIT license
Vancouver Event and Relation System for Extraction

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to VERSE

civicmine
Text mining cancer biomarkers for the CIVIC database
Stars: ✭ 19 (+46.15%)
Mutual labels:  text-mining, bionlp
CREST
A Causal Relation Schema for Text
Stars: ✭ 19 (+46.15%)
Mutual labels:  relation-extraction, bionlp
m3gm
Max-Margin Markov Graph Models for WordNet (EMNLP 2018)
Stars: ✭ 40 (+207.69%)
Mutual labels:  relation-extraction
rosette-elasticsearch-plugin
Document Enrichment plugin for Elasticsearch
Stars: ✭ 25 (+92.31%)
Mutual labels:  text-mining
textreadr
Tools to uniformly read in text data including semi-structured transcripts
Stars: ✭ 65 (+400%)
Mutual labels:  text-mining
Relation-Classification
Relation Classification - SEMEVAL 2010 task 8 dataset
Stars: ✭ 46 (+253.85%)
Mutual labels:  relation-extraction
InformationExtractionSystem
Information Extraction System can perform NLP tasks like Named Entity Recognition, Sentence Simplification, Relation Extraction etc.
Stars: ✭ 27 (+107.69%)
Mutual labels:  relation-extraction
deduce
Deduce: de-identification method for Dutch medical text
Stars: ✭ 40 (+207.69%)
Mutual labels:  text-mining
knodle
A PyTorch-based open-source framework that provides methods for improving the weakly annotated data and allows researchers to efficiently develop and compare their own methods.
Stars: ✭ 76 (+484.62%)
Mutual labels:  relation-extraction
extractnet
A Dragnet that also extract author, headline, date, keywords from context
Stars: ✭ 52 (+300%)
Mutual labels:  text-mining
pynsett
A programmable relation extraction tool
Stars: ✭ 25 (+92.31%)
Mutual labels:  relation-extraction
JoSH
[KDD 2020] Hierarchical Topic Mining via Joint Spherical Tree and Text Embedding
Stars: ✭ 55 (+323.08%)
Mutual labels:  text-mining
Shukongdashi
使用知识图谱,自然语言处理,卷积神经网络等技术,基于python语言,设计了一个数控领域故障诊断专家系统
Stars: ✭ 109 (+738.46%)
Mutual labels:  relation-extraction
cometa
Corpus of Online Medical EnTities: the cometA corpus
Stars: ✭ 31 (+138.46%)
Mutual labels:  bionlp
Twitter-Sentiment-Analyzer
Twitter Sentiment Analyzer
Stars: ✭ 13 (+0%)
Mutual labels:  text-mining
neji
Flexible and powerful platform for biomedical information extraction from text
Stars: ✭ 37 (+184.62%)
Mutual labels:  text-mining
Udacity-Data-Analyst-Nanodegree
Repository for the projects needed to complete the Data Analyst Nanodegree.
Stars: ✭ 31 (+138.46%)
Mutual labels:  text-mining
SEDTWik-Event-Detection-from-Tweets
Segmentation based event detection from Tweets. Published at NAACL SRW 2019
Stars: ✭ 58 (+346.15%)
Mutual labels:  text-mining
textlearnR
A simple collection of well working NLP models (Keras, H2O, StarSpace) tuned and benchmarked on a variety of datasets.
Stars: ✭ 16 (+23.08%)
Mutual labels:  text-mining
reader
Distant Reader, a tool for using & understanding a corpus
Stars: ✭ 18 (+38.46%)
Mutual labels:  text-mining

The VERSE project has been superceded by the kindred python package which has a cleaner interface for relation extraction. We suggest that you use that instead.

VERSE v1.0

This is the repo for the Vancouver Event and Relation System for Extraction (VERSE) project. It is a biomedical event and relation extractor designed for knowledge base construction purposes. It competed in the BioNLP'16 Shared Task and placed first in the Bacteria Biotope event subtask and third in the Seed Development binary event subtask.

How does it work (briefly)?

VERSE uses the Stanford CoreNLP tools to parse a corpus of text in order to get tokens, parts-of-speech and dependency graph information. With a suitable training set, the resulting parsed data is then vectorised and fed into scikit-learn classifiers (either SVM or LogisticRegression). The predictions can then be exported out to a JSON format or the standard format used in BioNLP Shared Tasks (known as ST format).

What types of things can it predict?

VERSE can predict entities, relations and modifiers. The full event extraction pipeline uses these three prediction systems in sequence (as used in the Genia Extraction subtask of BioNLP'16 ST). The predictors can be used individually as well.

What dependencies does VERSE have?

VERSE relies on the following things

What's in this repo?

  • core - The main tools for text processing and prediction
  • utils - Utilities to filter the input and output of VERSE (e.g. for invalid relations) and also to export to various formats.
  • evaluation - Scripts to evaluate the results of VERSE
  • BioNLP16-ST - Scripts to run VERSE with the correct settings for the entries to the different subtasks in the BioNLP'16 Shared Task.

Example Usage

The basic usage of VERSE involves first parsing text, then predicting entities, relations and modifications separately, and finally filtering and exporting. Here's the breakdown and explanation for running the GE4 dataset. The code below assumes that $python points towards an appropriate version of Python 2.7, $jython points towards jython and $verseDir points towards the root of VERSE. Note that these examples aren't used the parameters used for the competition. Smaller parameter sets are used for explanation. Check the BioNLP16-ST/GE4_auxFiles directory for the actual parameter sets used.

This shows the full pipeline but entity/modification extraction could be skipped if only relation extraction is required. Each step uses the output file from the previous stage.

Getting the data

First we'll download the appropriate files (the training set and the test set)

wget http://pubannotation.org/projects/bionlp-st-ge-2016-reference/annotations.tgz
mv annotations.tgz bionlp-st-ge-2016-reference.tgz
tar xvf bionlp-st-ge-2016-reference.tgz

wget http://pubannotation.org/projects/bionlp-st-ge-2016-test-proteins/annotations.tgz
mv annotations.tgz bionlp-st-ge-2016-test-proteins.tgz
tar xvf bionlp-st-ge-2016-test-proteins.tgz

Some preprocessing

This is only really needed for the GE4 dataset and involves removing duplicates from the official data and also correcting the types of some entities (specifically, "Protein_domain" & "DNA" to "Entity")

find bionlp-st-ge-2016-* -name '*.json' | xargs -I FILE $python $verseDir/utils/RemoveDuplicatesInJSON.py FILE
find bionlp-st-ge-2016-* -name '*.json' | xargs -I FILE $python $verseDir/utils/CleanupGE4Data.py FILE

Text Processing

Next we run the text through the Stanford CoreNLP pipeline using the TextProcessor script. In this case we're also using the --splitTokensForGE4 to deal with tokens that contain subparts, likek "ALK-dependent". And the --knownEntities parameter keeps track of which entities won't need to be predicted for this project (i.e. we should always have Proteins annotated).

$jython $verseDir/core/TextProcessor.py --format JSON --inDir bionlp-st-ge-2016-reference --outFile trainOrig.verse --splitTokensForGE4 --knownEntities "Protein"
$jython $verseDir/core/TextProcessor.py --format JSON --inDir bionlp-st-ge-2016-test-proteins --outFile test.verse --splitTokensForGE4 --knownEntities "Protein"

Clear Predicted

This is unnecessary here, but is important when working with fully annotated data for parameter optimisation. This basically removes any predictions currently in a VERSE file

$python $verseDir/utils/RemovePredicted.py --inFile test.verse --outFile start.verse

Entity Extraction

This takes in the annotated training VERSE file, vectorises it and trains a classifier. It then vectorises the test VERSE file and predicts entities on it. We'll use some basic parameters here as examples.

$python $verseDir/core/EntityExtractor.py --trainingFile trainOrig.verse --testingFile start.verse --outFile entities.verse --parameters "classifier:SVM ; doFeatureSelection:False ; featureChoice:ngrams,bigrams_entityWindowRight_4" --entityDescriptions $verseDir/BioNLP16-ST/GE4_auxFiles/entity.descriptions

The --parameters option takes a semi-colon delimited list of parameters for the vectorizer and classifier. The example above is telling it to use an SVM, not do feature selection and use ngrams as well as a bigrams window (including 4 tokens to the right of each entity).

The --entityDescriptions option is a file that simply contains the names of all entities that should be predicted.

Relation Extraction

Now we do basically the same thing for relations: vectorize and train using the training file, and then predict on the test file (notice that we've used the output from the last extractor).

$python $verseDir/core/RelationExtractor.py --trainingFile trainOrig.verse --testingFile entities.verse --outFile relations.verse --parameters "classifier:LogisticRegression ; doFeatureSelection:True ; featureSelectPerc:30 ; tfidf:True ; doFiltering:True ; sentenceRange:1; featureChoice:ngrams" --relationDescriptions $verseDir/BioNLP16-ST/GE4_auxFiles/rel.filters

The --parameters option works exactly the same as the entity extraction. This time it instructs the extractor to use a logistic regression classifier, do feature selection (keeping the top 30%). It also tells the vectorizer to normalise using TFIDF. Importantly it filters the relations by argument types. It also has a sentence range of 1, meaning that it attempts to learn and predict relations that cross one sentence boundary.

The argument types are specified in the file passed to --relationDescriptions. This is a three column file with the first column being the name of relation, and the next two being acceptable argument types. This filtering works only on the input used to train and not necessarily on the output stage. Therefore final filtering should still be applied.

Modification Extraction

We do the final extraction stage for modification that works similarly to the other extractors.

$python $verseDir/core/ModificationExtractor.py --trainingFile trainOrig.orig --testingFile relations.verse --outFile modSpeculation.verse --parameters "classifier:LogisticRegression ; threshold:0.2 ; featureChoice:ngrams" --modificationDescriptions $verseDir/BioNLP16-ST/GE4_auxFiles/mod.speculation

Again the --parameters option works in the same way to the other extractors. This time we use a logistic regression classifier with a specific threshold of 0.2. Using a lower threshold reduces the required probability for a prediction to be made, likely lowering false negatives but increasing false positives. The modificationDescriptions file is a simple file that names the modification(s) to be predicted.

If there is the possibility of an event having more than one modification (i.e. negation AND speculation), then multiple rounds of modification extractor should be run. Further ones (like below) should use the --mergeWithExisting flag so that existing modifications are not removed.

$python $verseDir/core/ModificationExtractor.py --trainingFile $trainOrig --testingFile modSpeculation.verse --outFile modSpeculationNegation.verse --parameters "classifier:LogisticRegression ; threshold:0.2; featureChoice:ngrams" --modificationDescriptions $verseDir/BioNLP16-ST/GE4_auxFiles/mod.negation --mergeWithExisting

Filtering

The resulting predictions should be filtered to remove any relations of incorrect types, entities that aren't associated with relations and modifications of removed entities.

$python $verseDir/utils/Filter.py --relationFilters $verseDir/BioNLP16-ST/GE4_auxFiles/rel.filters --modificationFilters $verseDir/BioNLP16-ST/GE4_auxFiles/mod.filters --inFile modSpeculationNegation.verse --outFile filtered.verse

The filter files give the expected types for relations and modifications

Export

With the predictions complete, we can now export back to the JSON format.

triggerTypes="Acetylation,Binding,Deacetylation,Gene_expression,Localization,Negative_regulation,Phosphorylation,Positive_regulation,Protein_catabolism,Protein_modification,Regulation,Transcription,Ubiquitination"
mkdir json
$python $verseDir/utils/ExportToJSON.py --inFile modSpeculationNegation.verse --outDir json --triggerTypes "$triggerTypes" --origDir bionlp-st-ge-2016-test-proteins

This takes in the original directory (--origDir) for the raw text and the entity types that should be used as event triggers (--triggerTypes as a comma-delimited list)

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].