All Projects → mims-harvard → Tdc

mims-harvard / Tdc

Licence: mit
Therapeutics Data Commons: Machine Learning Datasets and Tasks for Therapeutics

Projects that are alternatives of or similar to Tdc

Awesome Cheminformatics
A curated list of Cheminformatics libraries and software.
Stars: ✭ 244 (-16.15%)
Mutual labels:  bioinformatics, chemistry, cheminformatics
Smiles Transformer
Original implementation of the paper "SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery" by Shion Honda et al.
Stars: ✭ 86 (-70.45%)
Mutual labels:  jupyter-notebook, chemistry, cheminformatics
Cdk
The Chemistry Development Kit
Stars: ✭ 283 (-2.75%)
Mutual labels:  bioinformatics, chemistry, cheminformatics
full spectrum bioinformatics
An open-access bioinformatics text
Stars: ✭ 26 (-91.07%)
Mutual labels:  bioinformatics, biology
molml
A library to interface molecules and machine learning.
Stars: ✭ 57 (-80.41%)
Mutual labels:  chemistry, cheminformatics
AMPL
The ATOM Modeling PipeLine (AMPL) is an open-source, modular, extensible software pipeline for building and sharing models to advance in silico drug discovery.
Stars: ✭ 85 (-70.79%)
Mutual labels:  chemistry, cheminformatics
bioclipse.core
Bioclipse2 Core.
Stars: ✭ 21 (-92.78%)
Mutual labels:  chemistry, biology
Version3-1
Version 2020 (3.1) of Chem4Word - A Chemistry Add-In for Microsoft Word
Stars: ✭ 14 (-95.19%)
Mutual labels:  chemistry, cheminformatics
flexidot
Highly customizable, ambiguity-aware dotplots for visual sequence analyses
Stars: ✭ 73 (-74.91%)
Mutual labels:  bioinformatics, biology
chemicalx
A PyTorch and TorchDrug based deep learning library for drug pair scoring.
Stars: ✭ 176 (-39.52%)
Mutual labels:  chemistry, biology
qmflows
This library tackles the construction and efficient execution of computational chemistry workflows
Stars: ✭ 35 (-87.97%)
Mutual labels:  bioinformatics, chemistry
sirius
SIRIUS is a software for discovering a landscape of de-novo identification of metabolites using tandem mass spectrometry. This repository contains the code of the SIRIUS Software (GUI and CLI)
Stars: ✭ 32 (-89%)
Mutual labels:  bioinformatics, cheminformatics
protwis
Protwis is the backbone of the GPCRdb. The GPCRdb contains reference data, interactive visualisation and experiment design tools for G protein-coupled receptors (GPCRs).
Stars: ✭ 20 (-93.13%)
Mutual labels:  bioinformatics, cheminformatics
awesome-small-molecule-ml
A curated list of resources for machine learning for small-molecule drug discovery
Stars: ✭ 54 (-81.44%)
Mutual labels:  bioinformatics, cheminformatics
bioicons
A library of free open source icons for science illustrations in biology and chemistry
Stars: ✭ 665 (+128.52%)
Mutual labels:  chemistry, biology
Rcpi
Molecular informatics toolkit with a comprehensive integration of bioinformatics and cheminformatics tools for drug discovery.
Stars: ✭ 22 (-92.44%)
Mutual labels:  bioinformatics, cheminformatics
lexicon-mono-seq
DOM Text Based Multiple Sequence Alignment Library
Stars: ✭ 15 (-94.85%)
Mutual labels:  bioinformatics, biology
Bio.jl
[DEPRECATED] Bioinformatics and Computational Biology Infrastructure for Julia
Stars: ✭ 257 (-11.68%)
Mutual labels:  bioinformatics, biology
GLaDOS
Web Interface for ChEMBL @ EMBL-EBI
Stars: ✭ 28 (-90.38%)
Mutual labels:  chemistry, cheminformatics
MolecularGraph.jl
Graph-based molecule modeling toolkit for cheminformatics
Stars: ✭ 144 (-50.52%)
Mutual labels:  chemistry, cheminformatics

logo


website PyPI version Downloads Downloads GitHub Repo stars GitHub Repo stars Build Status TDC CircleCI

Project Website | Paper | TDC Mailing List | Twitter

Therapeutics Data Commons (TDC) is the first unifying framework to systematically access and evaluate machine learning across the entire range of therapeutics.

The collection of curated datasets, learning tasks, and benchmarks in TDC serves as a meeting point for domain and machine learning scientists. We envision that TDC can considerably accelerate machine-learning model development, validation and transition into biomedical and clinical implementation.

TDC is an open-source initiative. To get involved, join the Slack Workspace and check out the Contribution Guide!

Invited talk at the Harvard Symposium on Drugs for Future Pandemics (#futuretx20) [Slides] [Video]

Updates

  • 0.1.8: Streamlined and simplified the leaderboard programming frameworks! Now, you can submit a result for a single dataset! Checkout here!
  • TDC white paper is alive on arXiv!
  • 0.1.6: Released the second leaderboard on drug combination screening prediction! Checkout here!
  • 0.1.5: Added four realistic oracles from docking scores and synthetic accessibility! Checkout here!
  • 0.1.4: Added the 1st version of MolConvert class that can map among ~15 molecular formats in 2 lines of code (For 2D: from SMILES/SEFLIES and convert to SELFIES/SMILES, Graph2D, PyG, DGL, ECFP2-6, MACCS, Daylight, RDKit2D, Morgan, PubChem; For 3D: from XYZ, SDF files to Graph3D, Columb Matrix); Also a quality check on DTI datasets with IDs added.
  • Checkout Contribution Guide to add new dataset, task, function!
  • 0.1.3: Added new therapeutics task on CRISPR Repair Outcome Prediction! Added a data function to map molecule to popular cheminformatics fingerprint.
  • 0.1.2: The first TDC Leaderboard is released! Checkout the leaderboard guide here and the ADMET Leaderboard here.
  • 0.1.1: Replaced VD, Half Life and Clearance datasets from new sources that have higher qualities. Added LD50 to Tox.
  • 0.1.0: Molecule quality check for ADME, Toxicity and HTS (canonicalized, and remove error mols).
  • 0.0.9: Added DrugComb NCI-60, CYP2C9/2D6/3A4 substrates, Carcinogens toxicity!
  • 0.0.8: Added hREG, DILI, Skin Reaction, Ames Mutagenicity, PPBR from AstraZeneca; added meta oracles!

Features

  • Diverse areas of therapeutics development: TDC covers a wide range of learning tasks, including target discovery, activity screening, efficacy, safety, and manufacturing across biomedical products, including small molecules, antibodies, and vaccines.
  • Ready-to-use datasets: TDC is minimally dependent on external packages. Any TDC dataset can be retrieved using only 3 lines of code.
  • Data functions: TDC provides extensive data functions, including data evaluators, meaningful data splits, data processors, and molecule generation oracles.
  • Leaderboards: TDC provides benchmarks for fair model comparison and a systematic model development and evaluation.
  • Open-source initiative: TDC is an open-source initiative. If you want to get involved, let us know.

overview

Installation

Using pip

To install the core environment dependencies of TDC, use pip:

pip install PyTDC

Note: TDC is in the beta release. Please update your local copy regularly by

pip install PyTDC --upgrade

The core data loaders are lightweight with minimum dependency on external packages:

numpy, pandas, tqdm, scikit-learn, fuzzywuzzy, seaborn

For utilities requiring extra dependencies, TDC prints installation instructions. To install full dependencies, please use the following conda-forge solution.

Using conda

Data functions for molecule oracles, scaffold split, etc., require certain packages like RDKit. To install those packages, use the following conda installation:

conda install -c conda-forge pytdc

Cite Us

If you found our work useful, please cite us:

@article{tdc,
  title={Therapeutics Data Commons: Machine Learning Datasets and Tasks for Therapeutics},
  author={Huang, Kexin and Fu, Tianfan and Gao, Wenhao and Zhao, Yue and Roohani, Yusuf and Leskovec, Jure and Coley, Connor W and Xiao, Cao and Sun, Jimeng and Zitnik, Marinka},
  journal={arXiv preprint arXiv:2102.09548},
  year={2021}
}

Design of TDC

TDC has a unique three-tiered hierarchical structure, which to our knowledge, is the first attempt at systematically organizing machine learning for therapeutics. We organize TDC into three distinct problems. For each problem, we give a collection learning tasks. Finally, for each task, we provide a series of datasets.

In the first tier, after observing a large set of therapeutics tasks, we categorize and abstract out three major areas (i.e., problems) where machine learning can facilitate scientific advances, namely, single-instance prediction, multi-instance prediction, and generation:

  • Single-instance prediction single_pred: Prediction of property given individual biomedical entity.
  • Multi-instance prediction multi_pred: Prediction of property given multiple biomedical entities.
  • Generation generation: Generation of new desirable biomedical entities.

problems

The second tier in the TDC structure is organized into learning tasks. Improvement on these tasks can result in numerous applications, including identifying personalized combinatorial therapies, designing novel class of antibodies, improving disease diagnosis, and finding new cures for emerging diseases.

Finally, in the third tier of TDC, each task is instantiated via multiple datasets. For each dataset, we provide several splits of the dataset into training, validation, and test sets to simulate the type of understanding and generalization (e.g., the model's ability to generalize to entirely unseen compounds or to granularly resolve patient response to a polytherapy) needed for transition into production and clinical implementation.

TDC Data Loaders

TDC provides a collection of workflows with intuitive, high-level APIs for both beginners and experts to create machine learning models in Python. Building off the modularized "Problem--Learning Task--Data Set" structure (see above) in TDC, we provide a three-layer API to access any learning task and dataset. This hierarchical API design allows us to easily incorporate new tasks and datasets.

For a concrete example, to obtain the HIA dataset from ADME therapeutic learning task in the single-instance prediction problem:

from tdc.single_pred import ADME
data = ADME(name = 'HIA_Hou')
# split into train/val/test with scaffold split methods
split = data.get_split(method = 'scaffold')
# get the entire data in the various formats
data.get_data(format = 'df')

You can see all the datasets belonging to a task via:

from tdc.utils import retrieve_dataset_names
retrieve_dataset_names('ADME')

See all therapeutic tasks and datasets on the TDC website!

TDC Data Functions

Data Split

To retrieve the training/validation/test dataset split, you could simply type

data = X(name = Y)
data.get_split(seed = 42)
# {'train': df_train, 'val': df_val, 'test': df_test}

You can specify the splitting method, random seed, and split fractions in the function by e.g. data.get_split(method = 'scaffold', seed = 1, frac = [0.7, 0.1, 0.2]). Check out the data split page on the website for details.

Model Evaluation

We provide various evaluation metrics for the tasks in TDC, which are described in model evaluation page on the website. For example, to use metric ROC-AUC, you could simply type

from tdc import Evaluator
evaluator = Evaluator(name = 'ROC-AUC')
score = evaluator(y_true, y_pred)

Data Processing

We provide numerous data processing helper functions such as label transformation, data balancing, pair data to PyG/DGL graphs, negative sampling, database querying and so on. For individual function usage, please checkout the data processing page on the website.

Molecule Generation Oracles

For molecule generation tasks, we provide 10+ oracles for both goal-oriented and distribution learning. For detailed usage of each oracle, please checkout the oracle page on the website. For example, we want to retrieve the GSK3Beta oracle:

from tdc import Oracle
oracle = Oracle(name = 'GSK3B')
oracle(['CC(C)(C)....' 
	'C[[email protected]@H]1....',
	'CCNC(=O)....', 
	'C[[email protected]@H]1....'])

# [0.03, 0.02, 0.0, 0.1]

TDC Leaderboards

TDC hosts a series of leaderboards for researchers to keep abreast with the state-of-the-art models on therapeutics tasks.

Each dataset in TDC is a benchmark. But for a machine learning model to be useful for a specific downstream therapeutic usage, the model has to achieve consistently good performance across a set of datasets or tasks. Motivated by this, TDC intentionally group individual benchmarks into a benchmark group. Datasets in a benchmark group are centered around a theme and are all carefully selected. The dataset split and evaluation metrics are also carefully selected to reflect real-world challenges.

TDC provides a programming framework to access the data in a benchmark group. We use ADMET group as an example.

from tdc import BenchmarkGroup
group = BenchmarkGroup(name = 'ADMET_Group', path = 'data/')
predictions_list = []

for seed in [1, 2, 3, 4, 5]:
    predictions = {}
    for benchmark in group:
        name = benchmark['name']
        train_val, test = benchmark['train_val'], benchmark['test']
        train, valid = group.get_train_valid_split(benchmark = name, split_type = 'default', seed = seed)
        ## --- train your model --- ##
        y_pred = [1] * len(test)
        predictions[name] = y_pred
    predictions_list.append(predictions)

group.evaluate_many(predictions_list)

For more functions of the BenchmarkGroup class, please visit here.

Tutorials

We provide a series of tutorials for you to get started using TDC:

Name Description
101 Introduce TDC Data Loaders
102 Introduce TDC Data Functions
103.1 Walk through TDC Small Molecule Datasets
103.2 Walk through TDC Biologics Datasets
104 Generate 21 ADME ML Predictors with 15 Lines of Code
105 Molecule Generation Oracles

Contribute

TDC is an open-source community-driven effort. If you want to get involved, join the Slack Workspace and checkout the contribution guide!

Contact

Send emails to us or open an issue.

Data Server Maintenance Issues

TDC is hosted in Harvard Dataverse. When dataverse is under maintenance, TDC will not able to retrieve datasets. Although rare, when it happens, please come back in couple of hours or check the status by visiting the dataverse website.

License

TDC codebase is under MIT license. For individual dataset usage, please refer to the dataset license found in the website.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].