All Projects → jind11 → PubMed-PICO-Detection

jind11 / PubMed-PICO-Detection

Licence: other
PubMed PICO Element Detection Dataset

Projects that are alternatives of or similar to PubMed-PICO-Detection

text-classification-cn
中文文本分类实践,基于搜狗新闻语料库,采用传统机器学习方法以及预训练模型等方法
Stars: ✭ 81 (+118.92%)
Mutual labels:  corpus
PoetryCorpus
Поэтический корпус русского языка
Stars: ✭ 40 (+8.11%)
Mutual labels:  corpus
cljs-corpus
A greppable archive of ClojureScript code
Stars: ✭ 37 (+0%)
Mutual labels:  corpus
TV4Dialog
No description or website provided.
Stars: ✭ 33 (-10.81%)
Mutual labels:  corpus
CBLUE
中文医疗信息处理基准CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark
Stars: ✭ 379 (+924.32%)
Mutual labels:  corpus
CLUEmotionAnalysis2020
CLUE Emotion Analysis Dataset 细粒度情感分析数据集
Stars: ✭ 3 (-91.89%)
Mutual labels:  corpus
thaigov-corpus
โครงการเก็บรวบรวมข่าวสารจากเว็บไซต์รัฐบาลไทย
Stars: ✭ 19 (-48.65%)
Mutual labels:  corpus
thai-language
computer tools for thai language
Stars: ✭ 20 (-45.95%)
Mutual labels:  corpus
pdf-corpus
Python script to quickly create hand-crafted PDF files
Stars: ✭ 17 (-54.05%)
Mutual labels:  corpus
KAREN
KAREN: Unifying Hatespeech Detection and Benchmarking
Stars: ✭ 18 (-51.35%)
Mutual labels:  sentence-classification
Customer-Feedback-Analysis
Multi Class Text (Feedback) Classification using CNN, GRU Network and pre trained Word2Vec embedding, word embeddings on TensorFlow.
Stars: ✭ 18 (-51.35%)
Mutual labels:  sentence-classification
egret-wenda-corpus
A Public Corpus for Machine Learning
Stars: ✭ 41 (+10.81%)
Mutual labels:  corpus
bible-corpus
A multilingual parallel corpus created from translations of the Bible.
Stars: ✭ 115 (+210.81%)
Mutual labels:  corpus
LanguageCodes
We present a list of languages with their codes, families, regions and etc. We also present a list of multi-lingual corpora (with urls).
Stars: ✭ 70 (+89.19%)
Mutual labels:  corpus
named-entity-recognition-template
Build a deep learning model for predicting the named entities from text.
Stars: ✭ 51 (+37.84%)
Mutual labels:  corpus
kanji-frequency
Kanji usage frequency data collected from various sources
Stars: ✭ 92 (+148.65%)
Mutual labels:  corpus
CNN-Sentence-Classification
A tensorflow implementation of Convolutional Neural Networks for Sentence Classification
Stars: ✭ 77 (+108.11%)
Mutual labels:  sentence-classification
folia
FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for proces…
Stars: ✭ 56 (+51.35%)
Mutual labels:  corpus
NSP-BERT
The code for our paper "NSP-BERT: A Prompt-based Zero-Shot Learner Through an Original Pre-training Task —— Next Sentence Prediction"
Stars: ✭ 166 (+348.65%)
Mutual labels:  sentence-classification
KWDLC
Kyoto University Web Document Leads Corpus
Stars: ✭ 64 (+72.97%)
Mutual labels:  corpus

PubMed PICO Element Detection Dataset

This dataset is introduced by Jin, Di, and Peter Szolovits. "PICO Element Detection in Medical Text via Long Short-Term Memory Neural Networks." Proceedings of the BioNLP 2018 workshop. 2018..

Abstract

Successful evidence-based medicine (EBM) applications rely on answering clinical questions by analyzing large medical literature databases. In order to formulate a well-defined, focused clinical question, a framework called PICO is widely used, which identifies the sentences in a given medical text that belong to the four components: Participants/Problem (P), Intervention (I), Comparison (C) and Outcome (O). In this work, we present a Long Short-Term Memory (LSTM) neural network based model to automatically detect PICO elements. By jointly classifying subsequent sentences in the given text, we achieve state-of-the-art results on PICO element classification compared to several strong baseline models. We also make our curated data public as a benchmarking dataset so that the community can benefit from it.

Some miscellaneous information:

  • structured_abstracts_PICO contains the original abstracts. The line that starts with ### indicates the PMID. After that line, each line contains the original section heading, the assgined gold label for train and test and the section content, separated by the symbol |. To create the gold label, key words in the section heading are checked and the mapping rule can be referred to the paper above-mentioned.
  • structured_abstracts_sentences_PICO is almost the same as structured_abstracts_PICO except that each section conent is sentence splitted using the Stanford CoreNLP toolkit so that each line has only one sentence and all numbers have been replaced by @.
  • The folder splitted contains the train, validation and test sets that are randomly splitted from the file structured_abstracts_sentences_PICO at the ratio of 8:1:1.

You are most welcome to share with us your analyses or work using this dataset by citing our paper!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].