All Projects → blengerich → CompBioDatasetsForMachineLearning

blengerich / CompBioDatasetsForMachineLearning

Licence: other
A Curated List of Computational Biology Datasets Suitable for Machine Learning

Projects that are alternatives of or similar to CompBioDatasetsForMachineLearning

transfermarkt-datasets
⚽️ Extract, prepare and publish Transfermarkt datasets.
Stars: ✭ 60 (-33.33%)
Mutual labels:  datasets
firestore-to-bigquery-export
NPM package for copying and converting Cloud Firestore data to BigQuery.
Stars: ✭ 26 (-71.11%)
Mutual labels:  datasets
data.world-py
Python package for data.world
Stars: ✭ 98 (+8.89%)
Mutual labels:  datasets
thermostat
Collection of NLP model explanations and accompanying analysis tools
Stars: ✭ 126 (+40%)
Mutual labels:  datasets
git-rdm
A research data management plugin for the Git version control system.
Stars: ✭ 34 (-62.22%)
Mutual labels:  datasets
clothing-detection-ecommerce-dataset
Clothing detection dataset
Stars: ✭ 43 (-52.22%)
Mutual labels:  datasets
awesome-dynamic-graphs
A collection of resources on dynamic/streaming/temporal/evolving graph processing systems, databases, data structures, datasets, and related academic and industrial work
Stars: ✭ 89 (-1.11%)
Mutual labels:  datasets
Jupyter Dock
Jupyter Dock is a set of Jupyter Notebooks for performing molecular docking protocols interactively, as well as visualizing, converting file formats and analyzing the results.
Stars: ✭ 179 (+98.89%)
Mutual labels:  computational-biology
CuratedStack-nocode-template
🧱 A template to create a CuratedStack without (or with) code
Stars: ✭ 86 (-4.44%)
Mutual labels:  curated-list
anime-streaming
📺 A curated list of worldwide legal anime streaming.
Stars: ✭ 49 (-45.56%)
Mutual labels:  curated-list
morghulis
No description or website provided.
Stars: ✭ 18 (-80%)
Mutual labels:  datasets
Hackathon-Italia
A collaborative and evolving list of hackathons in Italy
Stars: ✭ 27 (-70%)
Mutual labels:  curated-list
delitos-caba
🚓 Crime dataset for the City of Buenos Aires, Argentina
Stars: ✭ 44 (-51.11%)
Mutual labels:  datasets
awesome-backend
🚀 A curated and opinionated list of resources (English & Russian) for Backend developers | Структурированный список ресурсов для изучения Backend разработки
Stars: ✭ 826 (+817.78%)
Mutual labels:  curated-list
geodaData
Data package for accessing GeoDa datasets using R
Stars: ✭ 15 (-83.33%)
Mutual labels:  datasets
CNApy
An integrated visual environment for metabolic modeling with common methods such as FBA, FVA and Elementary Flux Modes, and advanced features such as thermodynamic methods, extended Minimal Cut Sets, OptKnock, RobustKnock, OptCouple and more!
Stars: ✭ 27 (-70%)
Mutual labels:  computational-biology
industrial-ml-datasets
A curated list of datasets, publically available for machine learning research in the area of manufacturing
Stars: ✭ 45 (-50%)
Mutual labels:  datasets
bac-genomics-scripts
Collection of scripts for bacterial genomics
Stars: ✭ 39 (-56.67%)
Mutual labels:  computational-biology
cscs
A curated list of Coding Style Conventions and Standards.
Stars: ✭ 1,486 (+1551.11%)
Mutual labels:  curated-list
awesome-software-dev
🔥 💯 📖 Curated list of documentation, plugins, links and more for software developers 📖
Stars: ✭ 23 (-74.44%)
Mutual labels:  curated-list

Computational Biology Datasets Suitable For Machine Learning

This is a curated list of computational biology datasets that have been pre-processed for machine learning. This list is a work in progress, please submit a pull request for any dataset you would like to advertise!

Genotyping

Name Description Comments
The Cancer Genome Atlas Variety of Cancer Data most cancer types have 100-1000 samples
NIH GDC Cancer, many types of genomic data
UK Biobank
European Genome-Phenome Archive
METABRIC The genomic profiles (somatic mutations [targeted sequencing], copy number alterations, and gene expression) of 2509 breast cancers.
HapMap
23andMe 2280 Public Domain Curated Genotypes
Mice SNPs, 2000+ samples 4 generations. It might be possible to learn a family structure out of the data.
Arabidopsis SNPs, 100+ phenotypes

Promoter-Enhancer Pairs

Name Description Comments
TargetFinder ~100,000 DNA-DNA interaction pairs

Gene/Protein Expression

Name Description Comments
GEO Main place for NCBI data
ENCODE Variety of assays to identify functional elements
ArrayExpress DNA sequencing, gene/protein expression, epigenetics
Cytometry Continuous flow cytometry data of 11 proteins+phospholipids, Discretized and cleaned data available offline Classical benchmark dataset for learning graphical models; contains known errors
Transcription factor binding ChIP-Seq data on 12 TFs
GTEx Landmark study for EQTL analysis
PharmacoGenomics DB
ProteomeXChange
BeatAML whole-exome sequencing, RNA sequencing and analyses of ex vivo drug sensitivity 672 tumour specimens collected from 562 patients

Single-cell Data

Name Description Comments
Single-cell expression atlas

Regulatory Networks

Name Description Comments
TRRUST manually curated database of human transcriptional regulatory network
Yeast Network 23-million yeast 2-hybrid experiments to investigate genetic interactions
Perturb-Seq Integrated model of perturbations, single cell phenotypes, and epistatic interactions
KEGG Metabolic Regulatory Network (Undirected) 65554 instances, 29 attributes each
KEGG Metabolic Regulatory Network (Directed) 53414 instance, 24 attributes each

Images

Name Description Comments
The Cancer Imaging Archive Extracts the images from the TCGA data
Multiple Myeloma DREAM Challenge Challenge to identify Multiple Myeloma Patients
Breast Cancer Wisconsin (Diagnostic) Data Set Predict whether the cancer is benign or malignant
DDSM Mammogram Database
Kaggle Soft Tissue Sarcomas Preprocessed subset of the TCIA study "Soft Tissue Sarcoma" segmentation task
Kaggle Cervical Cancer Screening Classify cervix type from images
CMELYON17 Pathology challenge - automated detection and classification of breast cancer metastases in whole-slide images of histological lymph node sections
Grand Challenges Datasets from biomedical image analysis competitions
Breast Cancer MRI Dataset Demographic, clinical, pathology, treatment, outcomes, and genomic data + MRI images

fMRI

Name Description Comments
ENGIMA Cerebellum Goal: Examine the relationships between regional atrophy and motor and cognitive dysfunction
Seizure Prediction Goal: Classify EEG time series into pre-seizure vs. interictal (i.e., not preceding a seizure).

Electronic Medical Records

Name Description Comments
MIMIC 59,000 EHRs
UCI Diabetes 130 US hospital data for 1999-2008
i2b2 Clinical notes only, designed for NLP tasks
PhysioNet
Metadata Acquired from Clinical Case Reports (MACCRs) 3,100 curated clinical case reports spanning 15 disease groups and more than 750 reports of rare diseases
eICU 200k EHRs

Radiographs

Name Description Comments
CheXPert 200k chest radiographs Competition and leaderboard associated
MIMIC-CXR ~400k chest x-rays, 14 labels Data on PhysioNet
PadChest 160k chest x-rays, 174 different findings

Protein-Protein Interactions

Name Description Comments
HINT (High-quality INTeractomes) curated compilation of high-quality protein-protein interactions from 8 interactome resources

Longitudinal Studies

Name Description Comments
National Population Health Survey Longitudinal Survey that collects health information via surveys every two years.

Protein Structure

Name Description Comments
ProteinNet Standardized dataset for learning protein structure. Includes sequences, structures, alignments, PSSMs, and standardized train/test/valid splits.

Natural Language Data

Name Description Comments
BioASQ Abstracts of medical articles (from PubMed); ontologies of medical concepts. Tasks: MLC, QA.
Cases Articles from medical case studies.
UPMC Pathology UPMC Pathology case studies.

Therapeutics

Name Description Comments
Therapeutic Data Commons Many preprocessed datasets for therapeutic discovery, including target discovery, activity modeling, efficacy and safety, and manufacturing. Available as Python modules.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].