Systematic Review Datasets

This repository provides an overview of labeled datasets used for Systematic Reviews. The datasets are available under an open licence and can be used for text mining and machine learning purposes. This repository contains scripts to collect, preprocess and clean the systematic review datasets.

Datasets

The datasets are alphabetically ordered. See index.csv for all available properties.

id	topic	n_papers	n_included	license
Appenzeller-Herzog_2020	Wilson disease	3453	29	CC-BY Attribution 4.0 International
Bannach-Brown_2019	Animal Model of Depression	1993	280	CC-BY Attribution 4.0 International
Bos_2018	Dementia	5746	11	CC-BY Attribution 4.0 International
Cohen_2006_ACEInhibitors	ACEInhibitors	2544	41	custom open license
Cohen_2006_ADHD	ADHD	851	20	custom open license
Cohen_2006_Antihistamines	Antihistamines	310	16	custom open license
Cohen_2006_AtypicalAntipsychotics	Atypical Antipsychotics	1120	146	custom open license
Cohen_2006_BetaBlockers	Beta Blockers	2072	42	custom open license
Cohen_2006_CalciumChannelBlockers	Calcium Channel Blockers	1218	100	custom open license
Cohen_2006_Estrogens	Estrogens	368	80	custom open license
Cohen_2006_NSAIDS	NSAIDS	393	41	custom open license
Cohen_2006_Opiods	Opiods	1915	15	custom open license
Cohen_2006_OralHypoglycemics	Oral Hypoglycemics	503	136	custom open license
Cohen_2006_ProtonPumpInhibitors	Proton Pump Inhibitors	1333	51	custom open license
Cohen_2006_SkeletalMuscleRelaxants	Skeletal Muscle Relaxants	1643	9	custom open license
Cohen_2006_Statins	Statins	3465	85	custom open license
Cohen_2006_Triptans	Triptans	671	24	custom open license
Cohen_2006_UrinaryIncontinence	Urinary Incontinence	327	40	custom open license
Hall_2012	Software Fault Prediction	8911	104	CC-BY Attribution 4.0 International
Kitchenham_2010	Software Engineering	1704	45	CC-BY Attribution 4.0 International
Kwok_2020	Virus Metagenomics	2481	120	CC-BY Attribution 4.0 International
Nagtegaal_2019	Nudging	2019	101	CC0
Radjenovic_2013	Software Fault Prediction	6000	48	CC-BY Attribution 4.0 International
Wahono_2015	Software Defect Detection	7002	62	CC-BY Attribution 4.0 International
Wolters_2018	Dementia	5019	19	CC-BY Attribution 4.0 International
van_Dis_2020	Anxiety-Related Disorders	10953	73	CC-BY Attribution 4.0 International
van_de_Schoot_2017	PTSD Trajectories	6189	43	CC-BY Attribution 4.0 International

Publishing your data

For publishing either your data, we recommend using the Open Science frame (OSF). OSF is part of the Center for Open Science (COS), which aims at increasing openness, integrity, and reproducibility of research (OSF, 2020). How to share your data using OSF: A step-by-step guide.

Another platform to publish your data open access is provided by Zenodo. Zenodo is a platform which encourages scientists to share all materials (including data) that are necessary to understand the scholarly process (Zenodo, 2020).

When uploading your dataset to OSF or Zenodo, make sure to provide all relevant information about the dataset, by filling out all available fields. The data to be put on Zenodo or OSF can be documented as extensively as you would like (flowcharts, explanation of certain decisions, etc.). This can include a link to the systematic review itself, if it has been published elsewhere.

License

When sharing your dataset or a link to your already published systematic review, we recommend using a CC-BY or CC0 license for both Zenodo and OSF. By adding a Creative Commons license, everybody from individual creators to large institutions are given a standardized way to allow use of their creative work under copyright law (Creative Commons, 2020).

In short, the CC-BY license means that reusers are allowed to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use. The CC0 license releases data in the public domain, allowing reuse in any form without any conditions. This can be appropriate when sharing (meta)data only. With both OSF (see step-by-step guide) and Zenodo you can easily add the license to your project after creating a project in either platform.

File format

The folder datasets/ has subfolders for the different systematic reviews datasets. In each of these subfolders, the .ipynb script retrieves a dataset from OSF or Zenodo, and preprocesses it by adding customized labels and marking duplicates. The script also reports the inclusion rate, and missing patterns and word clouds of titles and abstracts. After preprocessing, an ASReview-compatible dataset in .csv format is generated in the output/ folder. Extensions .csv, .xlsx, and .xls. CSV files should be comma-separated and UTF-8 encoded. To indicate labeling decisions, one can use "included" or "label_included". This label should be filled with all 0’s and 1’s, where 0 means that the record is not included and 1 means included.

License

The scripts in the current project are MIT licensed. The datasets (should) have a permissive license.

Contact

Contact details can be found at the ASReview project page.

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

asreview / systematic-review-datasets

Programming Languages

Labels

Projects that are alternatives of or similar to systematic-review-datasets

Systematic Review Datasets

Datasets

Publishing your data

License

File format

License

Contact