Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → davidsbatista → Annotated Semantic Relationships Datasets

davidsbatista / Annotated Semantic Relationships Datasets

A collections of public and free annotated datasets of relationships between entities/nominals (Portuguese and English)

Labels

nlp datasets information-extraction supervised-learning

Projects that are alternatives of or similar to Annotated Semantic Relationships Datasets

Complete Life Cycle Of A Data Science Project

Complete-Life-Cycle-of-a-Data-Science-Project

Stars: ✭ 140 (-74.68%)

Mutual labels: datasets, supervised-learning

Easypr

An easy, flexible, and accurate plate recognition project for Chinese licenses in unconstrained situations.

Stars: ✭ 6,046 (+993.31%)

Mutual labels: supervised-learning, datasets

Usc Ds Relationextraction

Distantly Supervised Relation Extraction

Stars: ✭ 378 (-31.65%)

Mutual labels: information-extraction

Voice datasets

🔊 A comprehensive list of open-source datasets for voice and sound computing (50+ datasets).

Stars: ✭ 494 (-10.67%)

Mutual labels: datasets

Chinese Nlp Corpus

Collections of Chinese NLP corpus

Stars: ✭ 438 (-20.8%)

Mutual labels: datasets

Video Understanding Dataset

A collection of recent video understanding datasets, under construction!

Stars: ✭ 387 (-30.02%)

Mutual labels: datasets

Awesome Robotics

A curated list of awesome links and software libraries that are useful for robots.

Stars: ✭ 478 (-13.56%)

Mutual labels: datasets

Animal Matting

Github repository for the paper End-to-end Animal Image Matting

Stars: ✭ 363 (-34.36%)

Mutual labels: datasets

Datasets

Machine learning datasets used in tutorials on MachineLearningMastery.com

Stars: ✭ 536 (-3.07%)

Mutual labels: datasets

Geobr

Easy access to official spatial data sets of Brazil in R and Python

Stars: ✭ 411 (-25.68%)

Mutual labels: datasets

Awesome Dataset Tools

🔧 A curated list of awesome dataset tools

Stars: ✭ 495 (-10.49%)

Mutual labels: datasets

Projects

🪐 End-to-end NLP workflows from prototype to production

Stars: ✭ 397 (-28.21%)

Mutual labels: datasets

Awesome Holistic 3d

A list of papers and resources (data,code,etc) for holistic 3D reconstruction in computer vision

Stars: ✭ 387 (-30.02%)

Mutual labels: datasets

Openml

Open Machine Learning

Stars: ✭ 489 (-11.57%)

Mutual labels: datasets

Awesome Cybersecurity Datasets

A curated list of amazingly awesome Cybersecurity datasets

Stars: ✭ 380 (-31.28%)

Mutual labels: datasets

Awesome Twitter Data

A list of Twitter datasets and related resources.

Stars: ✭ 533 (-3.62%)

Mutual labels: datasets

Paperrobot

Code for PaperRobot: Incremental Draft Generation of Scientific Ideas

Stars: ✭ 372 (-32.73%)

Mutual labels: datasets

Neuralnetwork.net

A TensorFlow-inspired neural network library built from scratch in C# 7.3 for .NET Standard 2.0, with GPU support through cuDNN

Stars: ✭ 392 (-29.11%)

Mutual labels: supervised-learning

Simple Ocr Opencv

A simple python OCR engine using opencv

Stars: ✭ 453 (-18.08%)

Mutual labels: supervised-learning

Loghub

A large collection of system log datasets for AI-powered log analytics

Stars: ✭ 551 (-0.36%)

Mutual labels: datasets

View All Similar Projects ➔

Datasets of Annotated Semantic Relationships

This repository contains annotated datasets which can be used to train supervised models for the task of semantic relationship extraction. If you know any more datasets, and want to contribute, please, notify me or submit a PR.

It's divided in 3 groups:

Traditional Information Extraction: relationships are manually annotated, and belongs to pre-determined type, i.e. a closed number of classes.

Open Information Extraction: relationships are manually annotated, but don't have any specific type.

Distantly Supervised: relationships are annotated by appying some Distant Supervision technique and are pre-determined.

Dataset	Nr. Classes	Language	Year	Cite
aimed.tar.gz	2	English	2005	Subsequence Kernels for Relation Extraction
wikipedia_datav1.0.tar.gz	53	English	2006	Integrating Probabilistic Extraction Models and Data Mining to Discover Relations and Patterns in Text
SemEval2007-Task4.tar.gz	7	English	2007	SemEval-2007 Task 04: Classification of Semantic Relations between Nominals
hlt-naacl08-data.txt	2	English	2007	Learning to Extract Relations from the Web using Minimal Supervision
ReRelEM.tar.gz	4	Portuguese	2009	Relation detection between named entities: report of a shared task
SemEval2010_task8_all_data.tar.gz	10 / 19 (directional)	English	2010	SemEval-2010 Task 8: Multi-Way Classification of Semantic Relations Between Pairs of Nominals
BioNLP.tar.gz	2	English	2011	Overview of BioNLP Shared Task 2011
DDICorpus2013.zip	4	English	2012	The DDI corpus: An annotated corpus with pharmacological substances and drug–drug interactions
ADE-Corpus-V2.zip	2	English	2013	Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports
DBpediaRelations-PT-0.2.txt.bz2	10	Portuguese	2013	Exploring DBpedia and Wikipedia for Portuguese Semantic Relationship Extraction
kbp37-master.zip	37 directional	English	2015	Relation Classification via Recurrent Neural Network

Dataset	Nr. Classes	Language	Year	Cite
DataSet-IJCNLP2011.tar.gz	Open	English	2011	Extracting Relation descriptors with Conditional Random Fields
reverb_emnlp2011_data.tar.gz	Open	English	2011	Identifying Relations for Open Information Extraction
ClausIE-datasets.tar.gz	Open	English	2013	ClausIE: Clause-Based Open Information Extraction
emnlp13_ualberta_experiments_v2.zip	Open	English	2013	Effectiveness and Efficiency of Open Relation Extraction

Dataset	Nr. Classes	Language	Year	Cite
http://iesl.cs.umass.edu/riedel/ecml/	Distant	English	2010	Modeling Relations and Their Mentions without Labeled Text
https://github.com/google-research-datasets/relation-extraction-corpus	Distant	English	2013	https://research.googleblog.com/2013/04/50000-lessons-on-how-to-read-relation.html

Traditional Information Extraction

DBpediaRelations-PT

Dateset: DBpediaRelations-PT-0.2.txt.bz2

Cite: Exploring DBpedia and Wikipedia for Portuguese Semantic Relationship Extraction

Description: A collections of sentences in Portuguese that express semantic relationships between pairs of entities extracted from DBPedia. The sentences were collected by distant supervision, and were than manuall revised.

AImed

Dateset: aimed.tar.gz

Cite: Subsequence Kernels for Relation Extraction

Description: It consists of 225 Medline abstracts, of which 200 are known to describe interactions between human proteins, while the other 25 do not refer to any interaction. There are 4084 protein references and around 1000 tagged interactions in this dataset.

SemEval 2007

Dateset: SemEval2007-Task4.tar.gz

Cite: SemEval-2007 Task 04: Classification of Semantic Relations between Nominals

Description: Small data set, containing 7 relationship types and a total of 1,529 annotated examples.

SemEval 2010

Dateset: SemEval2010_task8_all_data.tar.gz

Cite: SemEval-2010 Task 8: Multi-Way Classification of Semantic Relations Between Pairs of Nominals

Description: SemEval-2010 Task 8 as a multi-way classification task in which the label for each example must be chosen from the complete set of ten relations and the mapping from nouns to argument slots is not provided in advance. We also provide more data: 10,717 annotated examples, compared to 1,529 in SemEval-1 Task 4.

ReRelEM

Dateset: ReRelEM.tar.gz

Cite: Relation detection between named entities: report of a shared task

Description: First evaluation contest (track) for Portuguese whose goal was to detect and classify relations betweennamed entities in running text, called ReRelEM. Given a collection annotated with named entities belonging to ten different semantic categories, we marked all relationships between them within each document. We used the following fourfold relationship classification: identity, included-in, located-in, and other (which was later on explicitly detailed into twenty different relations).

Wikipedia

Dateset: wikipedia_datav1.0.tar.gz

Cite: Integrating Probabilistic Extraction Models and Data Mining to Discover Relations and Patterns in Text

Description: We sampled 1127 paragraphs from 271 articles from the online encyclopedia Wikipedia and labeled a total of 4701 relation instances. In addition to a large set of person-to-person relations, we also included links between people and organizations, as well as biographical facts such as birthday and jobTitle. In all, there are 53 labels in the training data.

Web

Dateset: hlt-naacl08-data.txt

Cite: Learning to Extract Relations from the Web using Minimal Supervision

Description: Corporate Acquisition Pairs and Person-Birthplace Pairs taken from the web. The corporate acquisition test set has a total of 995 instances, out of which 156 are positive. The person-birthplace test set has a total of 601 instances, and only 45 of them are positive.

BioNLP Shared Task

Dateset: BioNLP.tar.gz

Cite: Overview of BioNLP Shared Task 2011

Description: The task involves the recognition of two binary part-of relations between entities: PROTEIN-COMPONENT and SUBUNITCOMPLEX. The task is motivated by specific challenges: the identification of the components of proteins in text is relevant e.g. to the recognition of Site arguments (cf. GE, EPI and ID tasks), and relations between proteins and their complexes relevant to any task involving them. REL setup is informed by recent semantic relation tasks (Hendrickx et al., 2010). The task data, consisting of new annotations for GE data, extends a previously introduced resource (Pyysalo et al., 2009; Ohta et al., 2010a).

The DDI corpus

Dateset: DDICorpus2013.zip

Cite: The DDI corpus: An annotated corpus with pharmacological substances and drug–drug interactions

Description: The DDI corpus contains MedLine abstracts on drug-drug interactions as well as documents describing drug-drug interactions from the DrugBank database. This task is designed to address the extraction of drug-drug interactions as a whole, but divided into two subtasks to allow separate evaluation of the performance for different aspects of the problem. The task includes two subtasks:

Task 1: Recognition and classification of drug names.
Task 2: Extraction of drug-drug interactions. The extraction of drug-drug interactions is a specific relation extraction task in biomedical literature. This task could be very appealing to groups studying PPI (protein-protein interaction) extraction because they could adapt their systems to extract drug-drug interactions.

Four types of DDIs are proposed:

mechanism: This type is used to annotate DDIs that aredescribed by their PK mechanism (e.g.Grepafloxacin may inhibitthe metabolism of theobromine).
effect:This type is used to annotate DDIs describing an effect(e.g.In uninfected volunteers, 46% developed rash while receivingSUSTIVA and clarithromycin) or a PD mechanism (e.g.Chlorthali-done may potentiate the action of other antihypertensive drugs).
advice:This type is used when a recommendation or adviceregarding a drug interaction is given (e.g.UROXATRAL shouldnot be used in combination with other alpha-blockers).
int:This type is used when a DDI appears in the text withoutproviding any additional information (e.g.The interaction ofomeprazole and ketoconazole has been established)

ADE-V2

Dateset: ADE-Corpus-V2.zip

Cite: Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports

Description: The work presented here aims at generating a systematically annotated corpus that can support the development and validation of methods for the automatic extraction of drug-related adverse effects from medical case reports. The documents are systematically double annotated in various rounds to ensure consistent annotations. The annotated documents are finally harmonized to generate representative consensus annotations. In order to demonstrate an example use case scenario, the corpus was employed to train and validate models for the classification of informative against the non-informative sentences. A Maximum Entropy classifier trained with simple features and evaluated by 10-fold cross-validation resulted in the F1 score of 0.70 indicating a potential useful application of the corpus.

KBP-37

Dateset: kbp37-master.zip.zip

Cite: Relation Classification via Recurrent Neural Network

Description: This dataset is a revision of MIML-RE annotation dataset, provided by Gabor Angeli et al. (2014). They use both the 2010 and 2013 KBP official document collections, as well as a July 2013 dump of Wikipedia as the text corpus for annotation, 33811 sentences been annotated. To make the dataset more suitable for our task, we made several refinement:

First, we add direction to the relation names, such that ‘per:employee of’ is splited into two relations ‘per:employee of(e1,e2)’ and ‘per:employee of(e2,e1)’ except for ‘no relation’. According to description of KBP task,3 we replace ‘org:parents’ with ‘org:subsidiaries’ and replace ‘org:member of’ with ‘org:member’ (by their reverse directions). This leads to 76 relations in the dataset.
Then, we statistic the frequency of each relation with two directions separately. And relations with low frequency are discarded so that both directions of each relation occur more than 100 times in the dataset. To better balance the dataset, 80% ‘no relation’ sentences are also randomly discarded.
After that, dataset are randomly shuffled and then sentences under each relation are all split into three groups, 70% for training, 10% for development, 20% for test. Finally, we remove those sentences in the development and test set whose entity pairs and relation are appeared in a training sentence simultaneously.

Open Information Extraction

ReVerb

Dateset: reverb_emnlp2011_data.tar.gz

Cite: Identifying Relations for Open Information Extraction

Description: 500 sentences sampled from the Web, using Yahoo’s random link service.

ClausIE

Dateset: ClausIE-datasets.tar.gz

Cite: ClausIE: Clause-Based Open Information Extraction

Description:

Three different datasets. First, the Reverb dataset consists of 500 sentences with manually labeled extractions. The sentences have been obtained via the random-link service of Yahoo and are generally very noisy. Second, 200 random sentences from Wikipedia pages. These sentences are shorter, simpler, and less noisy than those of the Reverb dataset. Since some Wikipedia articles are written by non-native speakers, however, the Wikipedia sentences do contain some incorrect grammatical constructions. Third, 200 random sentences from the New York Times collection these sentences are generally very clean but tend to be long and complex.

Effectiveness and Efficiency of Open Relation Extraction

Dateset: emnlp13_ualberta_experiments_v2.zip

Cite: Effectiveness and Efficiency of Open Relation Extraction

Description: WEB-500 is a commonly used dataset, developed for the TextRunner experiments (Banko and Etzioni, 2008). These sentences are often incomplete and grammatically unsound, representing the challenges of dealing with web text. NYT-500 represents the other end of the spectrum with formal, well written new stories from the New York Times Corpus (Sandhaus, 2008). PENN-100 contains sentences from the Penn Treebank recently used in an evaluation of the TreeKernel method (Xu et al., 2013). We manually annotated the relations for WEB-500 and NYT-500 and use the PENN-100 annotations provided by TreeKernel’s authors (Xu et al., 2013).

Extracting Relation descriptors with Conditional Random Fields

Dateset: DataSet-IJCNLP2011.tar.gz

Cite: Extracting Relation descriptors with Conditional Random Fields

Description: New York Times data set contains 150 business articles from New York Times. The articles were crawled from the NYT website between November 2009 and January 2010. After sentence splitting and tokenization, we used the Stanford NER tagger (URL: http://nlp.stanford.edu/ner/index.shtml) to identify PER and ORG named entities from each sentence. For named entities that contain multiple tokens we concatenated them into a single token. We then took each pair of (PER, ORG) entities that occur in the same sentence as a single candidate relation instance, where the PER entity is treated as ARG-1 and the ORG entity is treated as ARG-2.

Wikipedia data was previously created by Aron Culotta et al.. Since the original data set did not contain the annotation information we need, we re-annotated it. Similarly, we performed sentence splitting, tokenization and NER tagging, and took pairs of (PER, PER) entities occurring in the same sentence as a candidate relation instance. We always treat the first PER entity as ARG-1 and the second PER entity as ARG-2.

Distant Supervision for Relation Extraction

NYT dataset

Dateset: http://iesl.cs.umass.edu/riedel/ecml/

Cite: Modeling Relations and Their Mentions without Labeled Text

Description: The NYT dataset is a widely used dataset on distantly supervisied relation extraction task. This dataset was generated by aligning freebase relations with the New York Times (NYT) corpus, with sentences from the years 2005-2006 used as the training corpus and sentences from 2007 used as the testing corpus.

Google's relation-extraction-corpus

Dateset: https://github.com/google-research-datasets/relation-extraction-corpus

Cite: https://research.googleblog.com/2013/04/50000-lessons-on-how-to-read-relation.html

Description: https://research.googleblog.com/2013/04/50000-lessons-on-how-to-read-relation.html

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 553

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗