All Projects → samoturk → Mol2vec

samoturk / Mol2vec

Licence: bsd-3-clause
Mol2vec - an unsupervised machine learning approach to learn vector representations of molecular substructures

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Mol2vec

rdkit-pypi
⚛️ RDKit Python Wheels on PyPi. 💻 pip install rdkit-pypi
Stars: ✭ 62 (-54.07%)
Mutual labels:  cheminformatics
Tdc
Therapeutics Data Commons: Machine Learning Datasets and Tasks for Therapeutics
Stars: ✭ 291 (+115.56%)
Mutual labels:  cheminformatics
Lstm chem
Implementation of the paper - Generative Recurrent Networks for De Novo Drug Design.
Stars: ✭ 87 (-35.56%)
Mutual labels:  cheminformatics
awesome-small-molecule-ml
A curated list of resources for machine learning for small-molecule drug discovery
Stars: ✭ 54 (-60%)
Mutual labels:  cheminformatics
Thermo
Thermodynamics and Phase Equilibrium component of Chemical Engineering Design Library (ChEDL)
Stars: ✭ 279 (+106.67%)
Mutual labels:  cheminformatics
Scipipe
Robust, flexible and resource-efficient pipelines using Go and the commandline
Stars: ✭ 826 (+511.85%)
Mutual labels:  cheminformatics
sirius
SIRIUS is a software for discovering a landscape of de-novo identification of metabolites using tandem mass spectrometry. This repository contains the code of the SIRIUS Software (GUI and CLI)
Stars: ✭ 32 (-76.3%)
Mutual labels:  cheminformatics
Stk
A Python library which allows construction and manipulation of complex molecules, as well as automatic molecular design and the creation of molecular databases.
Stars: ✭ 99 (-26.67%)
Mutual labels:  cheminformatics
Cdk
The Chemistry Development Kit
Stars: ✭ 283 (+109.63%)
Mutual labels:  cheminformatics
Molvs
Molecule Validation and Standardization
Stars: ✭ 76 (-43.7%)
Mutual labels:  cheminformatics
Rcpi
Molecular informatics toolkit with a comprehensive integration of bioinformatics and cheminformatics tools for drug discovery.
Stars: ✭ 22 (-83.7%)
Mutual labels:  cheminformatics
molecules
chemical graph theory library for JavaScript
Stars: ✭ 83 (-38.52%)
Mutual labels:  cheminformatics
Cirpy
Python wrapper for the NCI Chemical Identifier Resolver (CIR)
Stars: ✭ 55 (-59.26%)
Mutual labels:  cheminformatics
AMPL
The ATOM Modeling PipeLine (AMPL) is an open-source, modular, extensible software pipeline for building and sharing models to advance in silico drug discovery.
Stars: ✭ 85 (-37.04%)
Mutual labels:  cheminformatics
Smiles Transformer
Original implementation of the paper "SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery" by Shion Honda et al.
Stars: ✭ 86 (-36.3%)
Mutual labels:  cheminformatics
molml
A library to interface molecules and machine learning.
Stars: ✭ 57 (-57.78%)
Mutual labels:  cheminformatics
Openbabel
Open Babel is a chemical toolbox designed to speak the many languages of chemical data.
Stars: ✭ 492 (+264.44%)
Mutual labels:  cheminformatics
Aizynthfinder
A tool for retrosynthetic planning
Stars: ✭ 122 (-9.63%)
Mutual labels:  cheminformatics
Chemfiles
Library for reading and writing chemistry files
Stars: ✭ 95 (-29.63%)
Mutual labels:  cheminformatics
Rdkit
The official sources for the RDKit library
Stars: ✭ 1,164 (+762.22%)
Mutual labels:  cheminformatics

Mol2vec

Mol2vec - an unsupervised machine learning approach to learn vector representations of molecular substructures

Documentation Status

Requirements

Installation

pip install git+https://github.com/samoturk/mol2vec

Documentation

Read the documentation on Read the Docs.

To build the documentation install sphinx, numpydoc and sphinx_rtd_theme and then run make html in docs directory.

Usage

As python module

from mol2vec import features
from mol2vec import helpers

First line imports functions to generate "sentences" from molecules and train the model, and second line imports functions useful for depictions. Check examples directory for more details and Mol2vec notebooks repository for visualisations made to easily run in Binder. Binder

Command line tool

Mol2vec is an unsupervised machine learning approach to learn vector representations of molecular substructures. Command line application has subcommands to prepare a corpus from molecular data (SDF or SMILES), train Mol2vec model and featurize new samples.

Subcommand 'corpus'

Generates corpus to train Mol2vec model. It generates morgan identifiers (up to selected radius) which represent words (molecules are sentences). Words are ordered in the sentence according to atom order in canonical SMILES (generated when generating corpus) and at each atom starting by identifier at radius 0.
Corpus subcommand also optionally replaces rare identifiers with selected string (e.g. UNK) which can be later used to represent completely new substructures (i.e. at featurization step). NOTE: It saves the corpus with replaced uncommon identifiers in separate file with ending "_{selected string to replace uncommon}". Since this is unsupervised method we recommend using as much molecules as possible (e.g. complete ZINC database).

Performance:

Corpus generation using 20M compounds with replacement of uncommon identifiers takes 6 hours on 4 cores.

Example:

To prepare a corpus using radius 1, 4 cores, replace uncommon identifiers that appear <= 3 times with 'UNK' run: mol2vec corpus -i mols.smi -o mols.cp -r 1 -j 4 --uncommon UNK --threshold 3

Subcommand 'train'

Trains Mol2vec model using previously prepared corpus.

Performance:

Training the model on 20M sentences takes ~2 hours on 4 cores.

Example:

To train a Mol2vec model on corpus with replaced uncommon identifiers using Skip-gram, window size 10, generating 300 dimensional vectors and using 4 cores run: mol2vec train -i mols.cp_UNK -o model.pkl -d 300 -w 10 -m skip-gram --threshold 3 -j 4

Subcommand 'featurize'

Featurizes new samples using pre-trained Mol2vec model. It saves the result in CSV file with columns for molecule identifiers, canonical SMILES (generated during featurization) and all potential SD fields from input SDF file and finally followed by mol2vec-{0 to n-1} where n is dimensionality of embeddings in the model.

Example:

To featurize new samples using pre-trained embeddings and using vector trained on uncommon samples to represent new substructures: mol2vec featurize -i new.smi -o new.csv -m model.pkl -r 1 --uncommon UNK

For more detail on individual subcommand run: mol2vec $sub-command --help

How to cite?

@article{doi:10.1021/acs.jcim.7b00616,
author = {Jaeger, Sabrina and Fulle, Simone and Turk, Samo},
title = {Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition},
journal = {Journal of Chemical Information and Modeling},
volume = {0},
number = {ja},
pages = {null},
year = {0},
doi = {10.1021/acs.jcim.7b00616},

URL = {http://dx.doi.org/10.1021/acs.jcim.7b00616},
eprint = {http://dx.doi.org/10.1021/acs.jcim.7b00616}
}

Sponsor info

Initial development was supported by BioMed X Innovation Center, Heidelberg.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].