Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → Franck-Dernoncourt → Pubmed Rct

Franck-Dernoncourt / Pubmed Rct

PubMed 200k RCT dataset: a large dataset for sequential sentence classification.

Labels

machine-learning nlp corpus medical

Projects that are alternatives of or similar to Pubmed Rct

Medical-Names-Corpus

医疗语料库。医疗机构名语料库。药品本位码。

Stars: ✭ 26 (-74.26%)

Mutual labels: corpus, medical

Fastai

R interface to fast.ai

Stars: ✭ 85 (-15.84%)

Mutual labels: medical

Typing Assistant

Typing Assistant provides the ability to autocomplete words and suggests predictions for the next word. This makes typing faster, more intelligent and reduces effort.

Stars: ✭ 32 (-68.32%)

Mutual labels: corpus

Blacklab

A corpus retrieval engine based on Apache Lucene

Stars: ✭ 69 (-31.68%)

Mutual labels: corpus

Segment Open

Segment Source Distribution

Stars: ✭ 34 (-66.34%)

Mutual labels: medical

Ja.text8

Japanese text8 corpus for word embedding.

Stars: ✭ 79 (-21.78%)

Mutual labels: corpus

Lyrics Corpora

An unofficial Python API that allows users to create a corpus of lyrical text from their favorite artists and billboard charts

Stars: ✭ 13 (-87.13%)

Mutual labels: corpus

Chi Corpus

迟先生语料库

Stars: ✭ 96 (-4.95%)

Mutual labels: corpus

Dataset List

lists of text corpus and more (mainly Japanese)

Stars: ✭ 84 (-16.83%)

Mutual labels: corpus

Medical Question Answer Data

Medical question and answer dataset gathered from the web.

Stars: ✭ 65 (-35.64%)

Mutual labels: medical

Coarij

Corpus of Annual Reports in Japan

Stars: ✭ 55 (-45.54%)

Mutual labels: corpus

Mitie chinese wikipedia corpus

Pre-trained Wikipedia corpus by MITIE

Stars: ✭ 43 (-57.43%)

Mutual labels: corpus

Dltk

Deep Learning Toolkit for Medical Image Analysis

Stars: ✭ 1,249 (+1136.63%)

Mutual labels: medical

Chatterbot Corpus

A multilingual dialog corpus

Stars: ✭ 964 (+854.46%)

Mutual labels: corpus

Pyclue

Python toolkit for Chinese Language Understanding(CLUE) Evaluation benchmark

Stars: ✭ 91 (-9.9%)

Mutual labels: corpus

Wh covid19 app

Volunteer developed app containing information for frontline medical staff around COVID-19

Stars: ✭ 30 (-70.3%)

Mutual labels: medical

Bisweb

This is the repository for the BioImage Suite Web Project

Stars: ✭ 54 (-46.53%)

Mutual labels: medical

Russian news corpus

Russian mass media stemmed texts corpus / Корпус лемматизированных (морфологически нормализованных) текстов российских СМИ

Stars: ✭ 76 (-24.75%)

Mutual labels: corpus

Lexicon Thai

คลังศัพท์ภาษาไทย

Stars: ✭ 96 (-4.95%)

Mutual labels: corpus

Heartypatch

A single lead ECG heart-rate variability monitoring patch with ESP32

Stars: ✭ 92 (-8.91%)

Mutual labels: medical

View All Similar Projects ➔

PubMed 200k RCT dataset

The PubMed 200k RCT dataset is described in Franck Dernoncourt, Ji Young Lee. PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts. International Joint Conference on Natural Language Processing (IJCNLP). 2017..

Abstract:

PubMed 200k RCT is new dataset based on PubMed for sequential sentence classification. The dataset consists of approximately 200,000 abstracts of randomized controlled trials, totaling 2.3 million sentences. Each sentence of each abstract is labeled with their role in the abstract using one of the following classes: background, objective, method, result, or conclusion. The purpose of releasing this dataset is twofold. First, the majority of datasets for sequential short-text classification (i.e., classification of short texts that appear in sequences) are small: we hope that releasing a new large dataset will help develop more accurate algorithms for this task. Second, from an application perspective, researchers need better tools to efficiently skim through the literature. Automatically classifying each sentence in an abstract would help researchers read abstracts more efficiently, especially in fields where abstracts may be long, such as the medical field.

Some miscellaneous information:

PubMed 20k is a subset of PubMed 200k. I.e., any abstract present in PubMed 20k is also present in PubMed 200k.
PubMed_200k_RCT is the same as PubMed_200k_RCT_numbers_replaced_with_at_sign, except that in the latter all numbers had been replaced by @. (same for PubMed_20k_RCT vs. PubMed_20k_RCT_numbers_replaced_with_at_sign).
Since Github file size limit is 100 MiB, we had to compress PubMed_200k_RCT\train.7z and PubMed_200k_RCT_numbers_replaced_with_at_sign\train.zip. To uncompress train.7z, you may use 7-Zip on Windows, Keka on Mac OS X, or p7zip on Linux.

You are most welcome to share with us your analyses or work using this dataset!

Projects using the PubMed 200k RCT dataset

Titipat Achakulvisut, Chandra Bhagavatula, Daniel E Acuna, Konrad P Kording. Claim Extraction for Scientific Publications. 2018

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 101

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗