Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → samoturk → Mol2vec

samoturk / Mol2vec

Licence: bsd-3-clause

Mol2vec - an unsupervised machine learning approach to learn vector representations of molecular substructures

Programming Languages

139335 projects - #7 most used programming language

Labels

machine-learning cheminformatics

Projects that are alternatives of or similar to Mol2vec

⚛️ RDKit Python Wheels on PyPi. 💻 pip install rdkit-pypi

Stars: ✭ 62 (-54.07%)

Mutual labels: cheminformatics

Therapeutics Data Commons: Machine Learning Datasets and Tasks for Therapeutics

Stars: ✭ 291 (+115.56%)

Mutual labels: cheminformatics

Implementation of the paper - Generative Recurrent Networks for De Novo Drug Design.

Stars: ✭ 87 (-35.56%)

Mutual labels: cheminformatics

awesome-small-molecule-ml

A curated list of resources for machine learning for small-molecule drug discovery

Stars: ✭ 54 (-60%)

Mutual labels: cheminformatics

Thermodynamics and Phase Equilibrium component of Chemical Engineering Design Library (ChEDL)

Stars: ✭ 279 (+106.67%)

Mutual labels: cheminformatics

Robust, flexible and resource-efficient pipelines using Go and the commandline

Stars: ✭ 826 (+511.85%)

Mutual labels: cheminformatics

SIRIUS is a software for discovering a landscape of de-novo identification of metabolites using tandem mass spectrometry. This repository contains the code of the SIRIUS Software (GUI and CLI)

Stars: ✭ 32 (-76.3%)

Mutual labels: cheminformatics

A Python library which allows construction and manipulation of complex molecules, as well as automatic molecular design and the creation of molecular databases.

Stars: ✭ 99 (-26.67%)

Mutual labels: cheminformatics

The Chemistry Development Kit

Stars: ✭ 283 (+109.63%)

Mutual labels: cheminformatics

Molecule Validation and Standardization

Stars: ✭ 76 (-43.7%)

Mutual labels: cheminformatics

Molecular informatics toolkit with a comprehensive integration of bioinformatics and cheminformatics tools for drug discovery.

Stars: ✭ 22 (-83.7%)

Mutual labels: cheminformatics

chemical graph theory library for JavaScript

Stars: ✭ 83 (-38.52%)

Mutual labels: cheminformatics

Python wrapper for the NCI Chemical Identifier Resolver (CIR)

Stars: ✭ 55 (-59.26%)

Mutual labels: cheminformatics

The ATOM Modeling PipeLine (AMPL) is an open-source, modular, extensible software pipeline for building and sharing models to advance in silico drug discovery.

Stars: ✭ 85 (-37.04%)

Mutual labels: cheminformatics

Smiles Transformer

Original implementation of the paper "SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery" by Shion Honda et al.

Stars: ✭ 86 (-36.3%)

Mutual labels: cheminformatics

A library to interface molecules and machine learning.

Stars: ✭ 57 (-57.78%)

Mutual labels: cheminformatics

Open Babel is a chemical toolbox designed to speak the many languages of chemical data.

Stars: ✭ 492 (+264.44%)

Mutual labels: cheminformatics

A tool for retrosynthetic planning

Stars: ✭ 122 (-9.63%)

Mutual labels: cheminformatics

Library for reading and writing chemistry files

Stars: ✭ 95 (-29.63%)

Mutual labels: cheminformatics

The official sources for the RDKit library

Stars: ✭ 1,164 (+762.22%)

Mutual labels: cheminformatics

View All Similar Projects ➔

Mol2vec

Mol2vec - an unsupervised machine learning approach to learn vector representations of molecular substructures

Requirements

Python 3 (Python 2.x is not supported)
NumPy
matplotlib
seaborn
pandas
IPython
RDKit
scikit-learn
gensim
tqdm
joblib

Installation

pip install git+https://github.com/samoturk/mol2vec

Documentation

Read the documentation on Read the Docs.

To build the documentation install sphinx, numpydoc and sphinx_rtd_theme and then run make html in docs directory.

Usage

As python module

from mol2vec import features
from mol2vec import helpers

First line imports functions to generate "sentences" from molecules and train the model, and second line imports functions useful for depictions. Check examples directory for more details and Mol2vec notebooks repository for visualisations made to easily run in Binder.

Command line tool

Mol2vec is an unsupervised machine learning approach to learn vector representations of molecular substructures. Command line application has subcommands to prepare a corpus from molecular data (SDF or SMILES), train Mol2vec model and featurize new samples.

Subcommand 'corpus'

Generates corpus to train Mol2vec model. It generates morgan identifiers (up to selected radius) which represent words (molecules are sentences). Words are ordered in the sentence according to atom order in canonical SMILES (generated when generating corpus) and at each atom starting by identifier at radius 0.
Corpus subcommand also optionally replaces rare identifiers with selected string (e.g. UNK) which can be later used to represent completely new substructures (i.e. at featurization step). NOTE: It saves the corpus with replaced uncommon identifiers in separate file with ending "_{selected string to replace uncommon}". Since this is unsupervised method we recommend using as much molecules as possible (e.g. complete ZINC database).

Performance:

Corpus generation using 20M compounds with replacement of uncommon identifiers takes 6 hours on 4 cores.

Example:

To prepare a corpus using radius 1, 4 cores, replace uncommon identifiers that appear <= 3 times with 'UNK' run: mol2vec corpus -i mols.smi -o mols.cp -r 1 -j 4 --uncommon UNK --threshold 3

Subcommand 'train'

Trains Mol2vec model using previously prepared corpus.

Performance:

Training the model on 20M sentences takes ~2 hours on 4 cores.

Example:

To train a Mol2vec model on corpus with replaced uncommon identifiers using Skip-gram, window size 10, generating 300 dimensional vectors and using 4 cores run: mol2vec train -i mols.cp_UNK -o model.pkl -d 300 -w 10 -m skip-gram --threshold 3 -j 4

Subcommand 'featurize'

Featurizes new samples using pre-trained Mol2vec model. It saves the result in CSV file with columns for molecule identifiers, canonical SMILES (generated during featurization) and all potential SD fields from input SDF file and finally followed by mol2vec-{0 to n-1} where n is dimensionality of embeddings in the model.

Example:

To featurize new samples using pre-trained embeddings and using vector trained on uncommon samples to represent new substructures: mol2vec featurize -i new.smi -o new.csv -m model.pkl -r 1 --uncommon UNK

For more detail on individual subcommand run: mol2vec $sub-command --help

How to cite?

@article{doi:10.1021/acs.jcim.7b00616,
author = {Jaeger, Sabrina and Fulle, Simone and Turk, Samo},
title = {Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition},
journal = {Journal of Chemical Information and Modeling},
volume = {0},
number = {ja},
pages = {null},
year = {0},
doi = {10.1021/acs.jcim.7b00616},

URL = {http://dx.doi.org/10.1021/acs.jcim.7b00616},
eprint = {http://dx.doi.org/10.1021/acs.jcim.7b00616}
}

Sponsor info

Initial development was supported by BioMed X Innovation Center, Heidelberg.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 135

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (2) 🔗