All Projects → cthoyt → chembl-downloader

cthoyt / chembl-downloader

Licence: MIT License
Write reproducible code for getting and processing ChEMBL

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to chembl-downloader

dicy
A builder for LaTeX, knitr, literate Agda, literate Haskell and Pweave that automatically builds dependencies.
Stars: ✭ 22 (-18.52%)
Mutual labels:  reproducible-research
reana
REANA: Reusable research data analysis platform
Stars: ✭ 86 (+218.52%)
Mutual labels:  reproducible-research
r10e-ds-py
Reproducible Data Science in Python (SciPy 2019 Tutorial)
Stars: ✭ 12 (-55.56%)
Mutual labels:  reproducible-research
ck-web
Collective Knowledge web extension to browse CK repositories, visualize interactive graphs and articles, render CK-based websites, implement simple web services with JSON API (for example to crowdsource experiments or unify access to DNN). Demos of interactive articles, graphs and crowdsourced experiments:
Stars: ✭ 31 (+14.81%)
Mutual labels:  reproducible-research
polyaxon-quick-start
A quick start project for polyaxon
Stars: ✭ 30 (+11.11%)
Mutual labels:  reproducible-research
learning R
List of resources for learning R
Stars: ✭ 32 (+18.52%)
Mutual labels:  reproducible-research
ideal
Interactive Differential Expression AnaLysis - DE made accessible and reproducible
Stars: ✭ 24 (-11.11%)
Mutual labels:  reproducible-research
papers-as-modules
Software Papers as Software Modules: Towards a Culture of Reusable Results
Stars: ✭ 18 (-33.33%)
Mutual labels:  reproducible-research
emp
🔬 Empirical CLI
Stars: ✭ 42 (+55.56%)
Mutual labels:  reproducible-research
ck-env
CK repository with components and automation actions to enable portable workflows across diverse platforms including Linux, Windows, MacOS and Android. It includes software detection plugins and meta packages (code, data sets, models, scripts, etc) with the possibility of multiple versions to co-exist in a user or system environment:
Stars: ✭ 67 (+148.15%)
Mutual labels:  reproducible-research
single-cell-papers-with-code
Papers with code for single cell related papers
Stars: ✭ 20 (-25.93%)
Mutual labels:  reproducible-research
crowdsource-experiments-using-android-devices
Android application to participate in experiment crowdsourcing (such as workload crowd-benchmarking and crowd-tuning) using Collective Knowledge Framework and open repositories of knowledge:
Stars: ✭ 23 (-14.81%)
Mutual labels:  reproducible-research
DIME-LaTeX-Templates
DIME's LaTeX templates and LaTeX exercises teaching anyone new to LaTeX how to use LaTeX and how to use DIME's templates
Stars: ✭ 32 (+18.52%)
Mutual labels:  reproducible-research
ctuning-programs
Collective Knowledge extension with unified and customizable benchmarks (with extensible JSON meta information) to be easily integrated with customizable and portable Collective Knowledge workflows. You can easily compile and run these benchmarks using different compilers, environments, hardware and OS (Linux, MacOS, Windows, Android). More info:
Stars: ✭ 41 (+51.85%)
Mutual labels:  reproducible-research
reprozip-examples
Examples and demos for ReproZip
Stars: ✭ 13 (-51.85%)
Mutual labels:  reproducible-research
nonparametric-bayes
📓 Non-parametric Bayesian Inference for Conservation Decisions
Stars: ✭ 39 (+44.44%)
Mutual labels:  reproducible-research
Reproducibilty-Challenge-ECANET
Unofficial Implementation of ECANets (CVPR 2020) for the Reproducibility Challenge 2020.
Stars: ✭ 27 (+0%)
Mutual labels:  reproducible-research
formr.org
Chain simple surveys into longer runs to build complex studies. Use R to generate pretty feedback and complex designs.
Stars: ✭ 90 (+233.33%)
Mutual labels:  reproducible-research
dmipy
The open source toolbox for reproducible diffusion MRI-based microstructure estimation
Stars: ✭ 58 (+114.81%)
Mutual labels:  reproducible-research
reproducible
A set of tools for R that enhance reproducibility beyond package management
Stars: ✭ 33 (+22.22%)
Mutual labels:  reproducible-research

chembl_downloader

PyPI PyPI - Python Version PyPI - License DOI Code style: black

Don't worry about downloading/extracting ChEMBL or versioning - just use chembl_downloader to write code that knows how to download it and use it automatically.

Installation

$ pip install chembl-downloader

Database Usage

Download A Specific Version

import chembl_downloader

path = chembl_downloader.download_extract_sqlite(version='28')

After it's been downloaded and extracted once, it's smart and does not need to download again. It gets stored using pystow automatically in the ~/.data/chembl directory.

We'd like to implement something such that it could load directly into SQLite from the archive, but it appears this is a paid feature.

Download the Latest Version

First, you'll have to install bioversions with pip install bioversions, whose job it is to look up the latest version of many databases. Then, you can modify the previous code slightly by omitting the version keyword argument:

import chembl_downloader

path = chembl_downloader.download_extract_sqlite()

The version keyword argument is available for all functions in this package (e.g., including connect(), cursor(), and query()), but will be omitted below for brevity.

Automate Connection

Inside the archive is a single SQLite database file. Normally, people manually untar this folder then do something with the resulting file. Don't do this, it's not reproducible! Instead, the file can be downloaded and a connection can be opened automatically with:

import chembl_downloader

with chembl_downloader.connect() as conn:
    with conn.cursor() as cursor:
        cursor.execute(...)  # run your query string
        rows = cursor.fetchall()  # get your results

The cursor() function provides a convenient wrapper around this operation:

import chembl_downloader

with chembl_downloader.cursor() as cursor:
    cursor.execute(...)  # run your query string
    rows = cursor.fetchall()  # get your results

Run a query and get a pandas DataFrame

The most powerful function is query() which builds on the previous connect() function in combination with pandas.read_sql to make a query and load the results into a pandas DataFrame for any downstream use.

import chembl_downloader

sql = """
SELECT
    MOLECULE_DICTIONARY.chembl_id,
    MOLECULE_DICTIONARY.pref_name
FROM MOLECULE_DICTIONARY
JOIN COMPOUND_STRUCTURES ON MOLECULE_DICTIONARY.molregno == COMPOUND_STRUCTURES.molregno
WHERE molecule_dictionary.pref_name IS NOT NULL
LIMIT 5
"""

df = chembl_downloader.query(sql)
df.to_csv(..., sep='\t', index=False)

Suggestion 1: use pystow to make a reproducible file path that's portable to other people's machines (e.g., it doesn't have your username in the path).

Suggestion 2: RDKit is now pip-installable with pip install rdkit-pypi, which means most users don't have to muck around with complicated conda environments and configurations. One of the powerful but understated tools in RDKit is the rdkit.Chem.PandasTools module.

Access an RDKit supplier over entries in the SDF dump

This example is a bit more fit-for-purpose than the last two. The supplier() function makes sure that the latest SDF dump is downloaded and loads it from the gzip file into a rdkit.Chem.ForwardSDMolSupplier using a context manager to make sure the file doesn't get closed until after parsing is done. Like the previous examples, it can also explicitly take a version.

from rdkit import Chem

import chembl_downloader

with chembl_downloader.supplier() as suppl:
    data = []
    for i, mol in enumerate(suppl):
        if mol is None or mol.GetNumAtoms() > 50:
            continue
        fp = Chem.PatternFingerprint(mol, fpSize=1024, tautomerFingerprints=True)
        smi = Chem.MolToSmiles(mol)
        data.append((smi, fp))

This example was adapted from Greg Landrum's RDKit blog post on generalized substructure search.

SDF Usage

Get an RDKit substructure library

Building on the supplier() function, the get_substructure_library() makes the preparation of a substructure library automated and reproducible. Additionally, it caches the results of the build, which takes on the order of tens of minutes, only has to be done once and future loading from a pickle object takes on the order of seconds.

The implementation was inspired by Greg Landrum's RDKit blog post, Some new features in the SubstructLibrary. The following example shows how it can be used to accomplish some of the first tasks presented in the post:

from rdkit import Chem

import chembl_downloader

library = chembl_downloader.get_substructure_library()
query = Chem.MolFromSmarts('[O,N]=C-c:1:c:c:n:c:c:1')
matches = library.GetMatches(query)

Morgan Fingerprints Usage

Get the Morgan Fingerprint file

ChEMBL makes a file containing pre-computed 2048 bit radius 2 morgan fingerprints for each molecule available. It can be downloaded using:

import chembl_downloader

path = chembl_downloader.download_fps()

The version and other keyword arguments are also valid for this function.

Load fingerprints with chemfp

The following wraps the download_fps function with chemfp's fingerprint loader:

import chembl_downloader

arena = chembl_downloader.chemfp_load_fps()

The version and other keyword arguments are also valid for this function. More information on working with the arena object can be found here.

Extras

Store in a Different Place

If you want to store the data elsewhere using pystow (e.g., in pyobo I also keep a copy of this file), you can use the prefix argument.

import chembl_downloader

# It gets downloaded/extracted to 
# ~/.data/pyobo/raw/chembl/29/chembl_29/chembl_29_sqlite/chembl_29.db
path = chembl_downloader.download_extract_sqlite(prefix=['pyobo', 'raw', 'chembl'])

See the pystow documentation on configuring the storage location further.

The prefix keyword argument is available for all functions in this package (e.g., including connect(), cursor(), and query()).

Download via CLI

After installing, run the following CLI command to ensure it and send the path to stdout

$ chembl_downloader

Use --test to show two example queries

$ chembl_downloader --test

Contributing

If you'd like to contribute, there's a submodule called chembl_downloader.queries where you can add an SQL query along with a description of what it does for easy importing.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].