All Projects → INK-USC → PLE

INK-USC / PLE

Licence: GPL-3.0 license
Label Noise Reduction in Entity Typing (KDD'16)

Programming Languages

C++
36643 projects - #6 most used programming language
fortran
972 projects
CMake
9771 projects
python
139335 projects - #7 most used programming language
c
50402 projects - #5 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to PLE

Usc Ds Relationextraction
Distantly Supervised Relation Extraction
Stars: ✭ 378 (+613.21%)
Mutual labels:  information-extraction, knowledgebase
nested-ner-tacl2020-flair
Implementation of Nested Named Entity Recognition using Flair
Stars: ✭ 23 (-56.6%)
Mutual labels:  information-extraction
alter-nlu
Natural language understanding library for chatbots with intent recognition and entity extraction.
Stars: ✭ 45 (-15.09%)
Mutual labels:  information-extraction
slotminer
Tool for slot extraction from text
Stars: ✭ 15 (-71.7%)
Mutual labels:  information-extraction
raneto-docker
Docker container for Markdown based Raneto Knowledgebase
Stars: ✭ 33 (-37.74%)
Mutual labels:  knowledgebase
TabInOut
Framework for information extraction from tables
Stars: ✭ 37 (-30.19%)
Mutual labels:  information-extraction
deduce
Deduce: de-identification method for Dutch medical text
Stars: ✭ 40 (-24.53%)
Mutual labels:  information-extraction
wen-notes
My notes.
Stars: ✭ 71 (+33.96%)
Mutual labels:  information-extraction
simple NER
simple rule based named entity recognition
Stars: ✭ 29 (-45.28%)
Mutual labels:  information-extraction
Deep-NLP-Resources
Curated list of all NLP Resources
Stars: ✭ 65 (+22.64%)
Mutual labels:  information-extraction
InformationExtractionSystem
Information Extraction System can perform NLP tasks like Named Entity Recognition, Sentence Simplification, Relation Extraction etc.
Stars: ✭ 27 (-49.06%)
Mutual labels:  information-extraction
iww
AI based web-wrapper for web-content-extraction
Stars: ✭ 61 (+15.09%)
Mutual labels:  information-extraction
trinity-ie
Information extraction pipeline containing coreference resolution, named entity linking, and relationship extraction
Stars: ✭ 59 (+11.32%)
Mutual labels:  information-extraction
kglib
TypeDB-ML is the Machine Learning integrations library for TypeDB
Stars: ✭ 523 (+886.79%)
Mutual labels:  knowledgebase
resources
Бесплатный образовательный контент, созданный и отобранный профессионалами
Stars: ✭ 20 (-62.26%)
Mutual labels:  knowledgebase
gotor
This program provides efficient web scraping services for Tor and non-Tor sites. The program has both a CLI and REST API.
Stars: ✭ 97 (+83.02%)
Mutual labels:  information-extraction
odinson
Odinson is a powerful and highly optimized open-source framework for rule-based information extraction. Odinson couples a simple, yet powerful pattern language that can operate over multiple representations of text, with a runtime system that operates in near real time.
Stars: ✭ 59 (+11.32%)
Mutual labels:  information-extraction
neji
Flexible and powerful platform for biomedical information extraction from text
Stars: ✭ 37 (-30.19%)
Mutual labels:  information-extraction
evildork
Evildork targeting your fiancee👁️
Stars: ✭ 46 (-13.21%)
Mutual labels:  information-extraction
QA4IE
Original implementation of QA4IE
Stars: ✭ 24 (-54.72%)
Mutual labels:  information-extraction

Heterogeneous Partial-Label Embedding

Source code and data for SIGKDD'16 paper Label Noise Reduction in Entity Typing by Heterogeneous Partial-Label Embedding.

Given a text corpus with entity mentions detected and heuristically labeled by distant supervision, this code performs (1) label noise reduction over distant supervision, and (2) learning type classifiers over de-noised training data. For example, check out PLE's output on Tech news.

An end-to-end tool (corpus to typed entities) is under development. Please keep track of our updates.

Performance

Performance of fine-grained entity type classification over Wiki (Ling & Weld, 2012) dataset. We applied PLE to clean training data and ran FIGER (Ling & Weld, 2012) and over the de-noised labeled data to train type classifiers (thus the FIGER + PLE is the name of our final system).

Method Accuray Macro-F1 Micro-F1
HYENA (Yosef et al., 2012) 0.288 0.528 0.506
WSABIE (Yogatama et al,., 2015) 0.480 0.679 0.657
FIGER (Ling & Weld, 2012) 0.474 0.692 0.655
FIGER + All Filter (Gillick et al., 2014) 0.453 0.648 0.582
FIGER + PLE (Ren et al., 2016) 0.599 0.763 0.749

System Output

The output on BBN dataset can be found here. Each line is a sentence in the test data of BBN, with entity mentions and their fine-grained entity typed identified.

Dependencies

  • python 2.7, g++
  • Python library dependencies
$ pip install pexpect unidecode six requests protobuf
$ cd DataProcessor/
$ git clone [email protected]:stanfordnlp/stanza.git
$ cd stanza
$ pip install -e .
$ wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
$ unzip stanford-corenlp-full-2016-10-31.zip
$ rm stanford-corenlp-full-2016-10-31.zip

Data

We processed (using our data pipeline) three public datasets to our JSON format. We ran Stanford NER on training set to detect entity mentions, and performed distant supervision using DBpediaSpotlight to assign type labels:

  • Wiki (Ling & Weld, 2012): 1.5M sentences sampled from 780k Wikipedia articles. 434 news sentences are manually annotated for evaluation. 113 entity types are organized into a 2-level hierarchy (download JSON)
  • OntoNotes (Weischedel et al., 2011): 13k news articles with 77 of them are manually labeled for evaluation. 89 entity types are organized into a 3-level hierarchy. (download JSON)
  • BBN (Weischedel et al., 2005): 2,311 WSJ articles that are manually annotated using 93 types in a 2-level hierarchy. (download JSON)
  • Type hierarches for each dataset are included.
  • Please put the data files in the corresponding subdirectories under PLE/Data/.

Makefile

We have included compilied binaries. If you need to re-compile hple.cpp under your own g++ environment

$ cd PLE/Model/ple/; make

Default Run

Run PLE for the task of Reduce Label Noise on the BBN dataset

$ java -mx4g -cp "DataProcessor/stanford-corenlp-full-2016-10-31/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer
$ ./run.sh  
  • The run.sh contains parameters for running on three datasets.

Parameters - run.sh

Dataset to run on.

Data="BBN"

Evaluation

Evaluate prediction results (by classifier trained on de-noised data) over test data

python Evaluation/evaluation.py BBN hple hete_feature
  • python Evaluation/evaluation.py -DATA(BBN/ontonotes/FIGER) -METHOD(hple/...) -EMB_MODE(hete_feature)

Reference

Please cite the following paper if you found the codes/datasets useful:

@inproceedings{ren2016label,
  title={Label Noise Reduction in Entity Typing by Heterogeneous Partial-Label Embedding},
  author={Ren, Xiang and He, Wenqi and Qu, Meng and Voss, Clare R and Ji, Heng and Han, Jiawei},
  booktitle={Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
  pages={1825--1834},
  year={2016},
  organization={ACM}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].