All Projects → PengNi → deepsignal-plant

PengNi / deepsignal-plant

Licence: GPL-3.0 License
Detecting methylation using signal-level features from Nanopore sequencing reads of plants

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to deepsignal-plant

deepsignal
Detecting methylation using signal-level features from Nanopore sequencing reads
Stars: ✭ 90 (+328.57%)
Mutual labels:  methylation, nanopore-sequencing
Packages
Aim to be the bioinformatics repository with more and newer packages
Stars: ✭ 13 (-38.1%)
Mutual labels:  bioinformatics
ctdna-pipeline
A simplified pipeline for ctDNA sequencing data analysis
Stars: ✭ 29 (+38.1%)
Mutual labels:  bioinformatics
redbiom
Sample search by metadata and features
Stars: ✭ 27 (+28.57%)
Mutual labels:  bioinformatics
mitre
The Microbiome Interpretable Temporal Rule Engine
Stars: ✭ 37 (+76.19%)
Mutual labels:  bioinformatics
sapporo
A standard implementation conforming to the Global Alliance for Genomics and Health (GA4GH) Workflow Execution Service (WES) API specification and a web application for managing and executing those WES services.
Stars: ✭ 17 (-19.05%)
Mutual labels:  bioinformatics
checklister
Checklist generator
Stars: ✭ 18 (-14.29%)
Mutual labels:  plant
searchgui
Highly adaptable common interface for proteomics search and de novo engines
Stars: ✭ 22 (+4.76%)
Mutual labels:  bioinformatics
qmflows
This library tackles the construction and efficient execution of computational chemistry workflows
Stars: ✭ 35 (+66.67%)
Mutual labels:  bioinformatics
BioD
A D library for computational biology and bioinformatics
Stars: ✭ 45 (+114.29%)
Mutual labels:  bioinformatics
weblogo
WebLogo 3: Sequence Logos redrawn
Stars: ✭ 106 (+404.76%)
Mutual labels:  bioinformatics
docker-builds
📦 🐳 Dockerfiles and documentation on tools for public health bioinformatics
Stars: ✭ 84 (+300%)
Mutual labels:  bioinformatics
gff3toembl
Converts Prokka GFF3 files to EMBL files for uploading annotated assemblies to EBI
Stars: ✭ 27 (+28.57%)
Mutual labels:  bioinformatics
lexicon-mono-seq
DOM Text Based Multiple Sequence Alignment Library
Stars: ✭ 15 (-28.57%)
Mutual labels:  bioinformatics
platon
Identification & characterization of bacterial plasmid-borne contigs from short-read draft assemblies.
Stars: ✭ 52 (+147.62%)
Mutual labels:  bioinformatics
MMseqs2-App
MMseqs2 app to run on your workstation or servers
Stars: ✭ 16 (-23.81%)
Mutual labels:  bioinformatics
immunedb
ImmuneDB - A system for the analysis and exploration of high-throughput adaptive immune receptor sequencing data
Stars: ✭ 13 (-38.1%)
Mutual labels:  bioinformatics
snakefmt
The uncompromising Snakemake code formatter
Stars: ✭ 78 (+271.43%)
Mutual labels:  bioinformatics
varsome-api-client-python
Example client programs for Saphetor's VarSome annotation API
Stars: ✭ 21 (+0%)
Mutual labels:  bioinformatics
streamformatics
Real-time species-typing visualisation for nanopore data.
Stars: ✭ 13 (-38.1%)
Mutual labels:  bioinformatics

DeepSignal-plant

Python GitHub-License

PyPI-version PyPI-DownloadsConda-version Conda-Downloads

A deep-learning method for detecting methylation state from Oxford Nanopore sequencing reads of plants.

deepsignal-plant applies BiLSTM to detect methylation from Nanopore reads. It is built on Python3 and PyTorch.

Known issues

  • The VBZ compression issue is not completely solved yet. Please try the commands listed below, normally it will work after setting HDF5_PLUGIN_PATH:
# 1. install hdf5/hdf5-tools
# ubuntu
sudo apt-get install libhdf5-serial-dev hdf5-tools
# centos
sudo yum install hdf5-devel

# 2. download ont-vbz-hdf-plugin-1.0.1-Linux-x86_64.tar.gz (or newer version) and set HDF5_PLUGIN_PATH
# https://github.com/nanoporetech/vbz_compression/releases
wget https://github.com/nanoporetech/vbz_compression/releases/download/v1.0.1/ont-vbz-hdf-plugin-1.0.1-Linux-x86_64.tar.gz
tar zxvf ont-vbz-hdf-plugin-1.0.1-Linux-x86_64.tar.gz
export HDF5_PLUGIN_PATH=/abslolute/path/to/ont-vbz-hdf-plugin-1.0.1-Linux/usr/local/hdf5/lib/plugin

References: issue #8, tombo issue #254, and vbz_compression issue #5.

Contents

Installation

deepsignal-plant is built on Python3 and PyTorch. Guppy and tombo are required to basecall and re-squiggle the raw signals from nanopore reads before running deepsignal-plant.

1. Create an environment

We highly recommend using a virtual environment for the installation of deepsignal-plant and its dependencies. A virtual environment can be created and (de)activated as follows using conda:

# create
conda create -n deepsignalpenv python=3.6

# activate
conda activate deepsignalpenv

# deactivate
conda deactivate

The virtual environment can also be created using virtualenv.

2. Install deepsignal-plant

  • After the environment being created and activated, deepsignal-plant can be installed using conda/pip, or from github directly:
# install using conda
conda install -c bioconda deepsignal-plant

# or install using pip
pip install deepsignal-plant

# or install from github (latest version)
git clone https://github.com/PengNi/deepsignal-plant.git
cd deepsignal-plant
python setup.py install
  • PyTorch can be automatically installed during the installation of deepsignal-plant. However, if the version of PyTorch installed is not appropriate for your OS, an appropriate version should be re-installed in the same environment as the instructions:
# install using conda
conda install pytorch==1.4.0 torchvision==0.5.0 cudatoolkit=10.1 -c pytorch

# or install using pip
pip install torch==1.4.0 torchvision==0.5.0
# install using conda
conda install -c bioconda ont-tombo

# or install using pip
pip install ont-tombo

Trained models

Currently, we have trained the following models:

Example data

Quick start

To call modifications, the raw fast5 files should be basecalled by Guppy (version>=3.6.1) and then be re-squiggled by tombo (version 1.5.1). At last, modifications of specified motifs can be called by deepsignal. Belows are commands to call 5mC in CG, CHG, and CHH contexts:

# Download and unzip the example data and pre-trained models.
# 1. guppy basecall using GPU
guppy_basecaller -i fast5s/ -r -s fast5s_guppy \
  --config dna_r9.4.1_450bps_hac_prom.cfg \
  --device CUDA:0

# 2. tombo resquiggle
cat fast5s_guppy/*.fastq > fast5s_guppy.fastq
tombo preprocess annotate_raw_with_fastqs --fast5-basedir fast5s/ \
  --fastq-filenames fast5s_guppy.fastq \
  --sequencing-summary-filenames fast5s_guppy/sequencing_summary.txt \
  --basecall-group Basecall_1D_000 --basecall-subgroup BaseCalled_template \
  --overwrite --processes 10
tombo resquiggle fast5s/ GCF_000001735.4_TAIR10.1_genomic.fna \
  --processes 10 --corrected-group RawGenomeCorrected_000 \
  --basecall-group Basecall_1D_000 --overwrite

# 3. deepsignal-plant call_mods
# 5mCs in all contexts (CG, CHG, and CHH) can be called at one time
CUDA_VISIBLE_DEVICES=0 deepsignal_plant call_mods --input_path fast5s/ \
  --model_path model.dp2.CNN.arabnrice2-1_120m_R9.4plus_tem.bn13_sn16.both_bilstm.epoch6.ckpt \
  --result_file fast5s.C.call_mods.tsv \
  --corrected_group RawGenomeCorrected_000 \
  --motifs C --nproc 30 --nproc_gpu 6
deepsignal_plant call_freq --input_path fast5s.C.call_mods.tsv \
  --result_file fast5s.C.call_mods.frequency.tsv
# split 5mC call_freq file into CG/CHG/CHH call_freq files
python /path/to/deepsignal_plant/scripts/split_freq_file_by_5mC_motif.py \
  --freqfile fast5s.C.call_mods.frequency.tsv

Usage

1. Basecall and re-squiggle

Before running deepsignal, the raw reads should be basecalled by Guppy (version>=3.6.1) and then be processed by the re-squiggle module of tombo (version 1.5.1).

Note:

  • If the fast5 files are in multi-read FAST5 format, please use multi_to_single_fast5 command from the ont_fast5_api package to convert the fast5 files before using tombo (Ref to issue #173 in tombo).
multi_to_single_fast5 -i $multi_read_fast5_dir -s $single_read_fast5_dir -t 30 --recursive

For the example data:

# 1. basecall using GPU
guppy_basecaller -i fast5s/ -r -s fast5s_guppy \
  --config dna_r9.4.1_450bps_hac_prom.cfg \
  --device CUDA:0
# or using CPU
guppy_basecaller -i fast5s/ -r -s fast5s_guppy \
  --config dna_r9.4.1_450bps_hac_prom.cfg

# 2. proprecess fast5 if basecall results are saved in fastq format
cat fast5s_guppy/*.fastq > fast5s_guppy.fastq
tombo preprocess annotate_raw_with_fastqs --fast5-basedir fast5s/ \
  --fastq-filenames fast5s_guppy.fastq \
  --sequencing-summary-filenames fast5s_guppy/sequencing_summary.txt \
  --basecall-group Basecall_1D_000 --basecall-subgroup BaseCalled_template \
  --overwrite --processes 10

# 3. resquiggle, cmd: tombo resquiggle $fast5_dir $reference_fa
tombo resquiggle fast5s/ GCF_000001735.4_TAIR10.1_genomic.fna \
  --processes 10 --corrected-group RawGenomeCorrected_000 \
  --basecall-group Basecall_1D_000 --overwrite

2. extract features

Features of targeted sites can be extracted for training or testing.

For the example data (By default, deepsignal-plant extracts 13-mer-seq and 13*16-signal features of each CpG motif in reads. Note that the value of --corrected_group must be the same as that of --corrected-group in tombo.):

# extract features of all Cs
deepsignal_plant extract -i fast5s \
  -o fast5s.C.features.tsv --corrected_group RawGenomeCorrected_000 \
  --nproc 30 --motifs C

The extracted_features file is a tab-delimited text file in the following format:

  • chrom: the chromosome name
  • pos: 0-based position of the targeted base in the chromosome
  • strand: +/-, the aligned strand of the read to the reference
  • pos_in_strand: 0-based position of the targeted base in the aligned strand of the chromosome (legacy column, not necessary for downstream analysis)
  • readname: the read name
  • read_strand: t/c, template or complement
  • k_mer: the sequence around the targeted base
  • signal_means: signal means of each base in the kmer
  • signal_stds: signal stds of each base in the kmer
  • signal_lens: lens of each base in the kmer
  • raw_signals: signal values for each base of the kmer, splited by ';'
  • methy_label: 0/1, the label of the targeted base, for training

3. call modifications

To call modifications, either the extracted-feature file or the raw fast5 files (recommended) can be used as input.

For the example data:

# call 5mCs for instance

# extracted-feature file as input, use CPU
CUDA_VISIBLE_DEVICES=-1 deepsignal_plant call_mods --input_path fast5s.C.features.tsv \
  --model_path model.dp2.CNN.arabnrice2-1_120m_R9.4plus_tem.bn13_sn16.both_bilstm.epoch6.ckpt \
  --result_file fast5s.C.call_mods.tsv \
  --nproc 30
# extracted-feature file as input, use GPU
CUDA_VISIBLE_DEVICES=0 deepsignal_plant call_mods --input_path fast5s.C.features.tsv \
  --model_path model.dp2.CNN.arabnrice2-1_120m_R9.4plus_tem.bn13_sn16.both_bilstm.epoch6.ckpt \
  --result_file fast5s.C.call_mods.tsv \
  --nproc 30 --nproc_gpu 6

# fast5 files as input, use CPU
CUDA_VISIBLE_DEVICES=-1 deepsignal_plant call_mods --input_path fast5s/ \
  --model_path model.dp2.CNN.arabnrice2-1_120m_R9.4plus_tem.bn13_sn16.both_bilstm.epoch6.ckpt \
  --result_file fast5s.C.call_mods.tsv \
  --corrected_group RawGenomeCorrected_000 \
  --motifs C --nproc 30
# fast5 files as input, use GPU
CUDA_VISIBLE_DEVICES=0 deepsignal_plant call_mods --input_path fast5s/ \
  --model_path model.dp2.CNN.arabnrice2-1_120m_R9.4plus_tem.bn13_sn16.both_bilstm.epoch6.ckpt \
  --result_file fast5s.C.call_mods.tsv \
  --corrected_group RawGenomeCorrected_000 \
  --motifs C --nproc 30 --nproc_gpu 6

The modification_call file is a tab-delimited text file in the following format:

  • chrom: the chromosome name
  • pos: 0-based position of the targeted base in the chromosome
  • strand: +/-, the aligned strand of the read to the reference
  • pos_in_strand: 0-based position of the targeted base in the aligned strand of the chromosome (legacy column, not necessary for downstream analysis)
  • readname: the read name
  • read_strand: t/c, template or complement
  • prob_0: [0, 1], the probability of the targeted base predicted as 0 (unmethylated)
  • prob_1: [0, 1], the probability of the targeted base predicted as 1 (methylated)
  • called_label: 0/1, unmethylated/methylated
  • k_mer: the kmer around the targeted base

4. call frequency of modifications

A modification-frequency file can be generated by call_freq function with the call_mods file as input:

# call 5mCs for instance

# output in tsv format
deepsignal_plant call_freq --input_path fast5s.C.call_mods.tsv \
  --result_file fast5s.C.call_mods.frequency.tsv
# output in bedMethyl format
deepsignal_plant call_freq --input_path fast5s.C.call_mods.tsv \
  --result_file fast5s.C.call_mods.frequency.bed --bed
# use --sort to sort the results
deepsignal_plant call_freq --input_path fast5s.C.call_mods.tsv \
  --result_file fast5s.C.call_mods.frequency.bed --bed --sort

The modification_frequency file can be either saved in bedMethyl format (by setting --bed as above), or saved as a tab-delimited text file in the following format by default:

  • chrom: the chromosome name
  • pos: 0-based position of the targeted base in the chromosome
  • strand: +/-, the aligned strand of the read to the reference
  • pos_in_strand: 0-based position of the targeted base in the aligned strand of the chromosome (legacy column, not necessary for downstream analysis)
  • prob_0_sum: sum of the probabilities of the targeted base predicted as 0 (unmethylated)
  • prob_1_sum: sum of the probabilities of the targeted base predicted as 1 (methylated)
  • count_modified: number of reads in which the targeted base counted as modified
  • count_unmodified: number of reads in which the targeted base counted as unmodified
  • coverage: number of reads aligned to the targeted base
  • modification_frequency: modification frequency
  • k_mer: the kmer around the targeted base

5. denoise training samples

# please use deepsignal_plant denoise -h/--help for instructions
deepsignal_plant denoise --train_file /path/to/train/file

6. train new models

A new model can be trained as follows:

# need to split training samples to two independent datasets for training and validating
# please use deepsignal_plant train -h/--help for instructions
deepsignal_plant train --train_file /path/to/train/file \
  --valid_file /path/to/valid/file \
  --model_dir /dir/to/save/the/new/model

License

Copyright (C) 2020 Jianxin Wang, Feng Luo, Peng Ni

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.

Jianxin Wang, Peng Ni, School of Computer Science and Engineering, Central South University, Changsha 410083, China

Feng Luo, School of Computing, Clemson University, Clemson, SC 29634, USA

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].