All Projects → lh3 → Dna Nn

lh3 / Dna Nn

Model and predict short DNA sequence features with neural networks

Programming Languages

c
50402 projects - #5 most used programming language

Projects that are alternatives of or similar to Dna Nn

Deeptools
Tools to process and analyze deep sequencing data.
Stars: ✭ 448 (+659.32%)
Mutual labels:  bioinformatics, genomics
Galaxy
Data intensive science for everyone.
Stars: ✭ 812 (+1276.27%)
Mutual labels:  bioinformatics, genomics
Ncbi Genome Download
Scripts to download genomes from the NCBI FTP servers
Stars: ✭ 494 (+737.29%)
Mutual labels:  bioinformatics, genomics
Bwa Mem2
The next version of bwa-mem
Stars: ✭ 408 (+591.53%)
Mutual labels:  bioinformatics, genomics
Awesome Sequencing Tech Papers
A collection of publications on comparison of high-throughput sequencing technologies.
Stars: ✭ 21 (-64.41%)
Mutual labels:  bioinformatics, genomics
Biojava
📖🔬☕️ BioJava is an open-source project dedicated to providing a Java library for processing biological data.
Stars: ✭ 434 (+635.59%)
Mutual labels:  bioinformatics, genomics
Hail
Scalable genomic data analysis.
Stars: ✭ 706 (+1096.61%)
Mutual labels:  bioinformatics, genomics
Megahit
Ultra-fast and memory-efficient (meta-)genome assembler
Stars: ✭ 343 (+481.36%)
Mutual labels:  bioinformatics, genomics
Sns
Analysis pipelines for sequencing data
Stars: ✭ 43 (-27.12%)
Mutual labels:  bioinformatics, genomics
Tiledb Vcf
Efficient variant-call data storage and retrieval library using the TileDB storage library.
Stars: ✭ 26 (-55.93%)
Mutual labels:  bioinformatics, genomics
Jcvi
Python library to facilitate genome assembly, annotation, and comparative genomics
Stars: ✭ 404 (+584.75%)
Mutual labels:  bioinformatics, genomics
Bwa
Burrow-Wheeler Aligner for short-read alignment (see minimap2 for long-read alignment)
Stars: ✭ 970 (+1544.07%)
Mutual labels:  bioinformatics, genomics
Jbrowse
A modern genome browser built with JavaScript and HTML5.
Stars: ✭ 393 (+566.1%)
Mutual labels:  bioinformatics, genomics
Gatk
Official code repository for GATK versions 4 and up
Stars: ✭ 1,002 (+1598.31%)
Mutual labels:  bioinformatics, genomics
Bowtie2
A fast and sensitive gapped read aligner
Stars: ✭ 365 (+518.64%)
Mutual labels:  bioinformatics, genomics
Nucleus
Python and C++ code for reading and writing genomics data.
Stars: ✭ 657 (+1013.56%)
Mutual labels:  bioinformatics, genomics
Pyfaidx
Efficient pythonic random access to fasta subsequences
Stars: ✭ 307 (+420.34%)
Mutual labels:  bioinformatics, genomics
Jvarkit
Java utilities for Bioinformatics
Stars: ✭ 313 (+430.51%)
Mutual labels:  bioinformatics, genomics
Fermi2
Stars: ✭ 23 (-61.02%)
Mutual labels:  bioinformatics, genomics
Minimap2
A versatile pairwise aligner for genomic and spliced nucleotide sequences
Stars: ✭ 912 (+1445.76%)
Mutual labels:  bioinformatics, genomics

Introduction

Dna-nn implements a proof-of-concept deep-learning model to learn relatively simple features on DNA sequences. So far it has been trained to identify the (ATTCC)n and alpha satellite repeats, which occupy over 5% of the human genome. Taking the RepeatMasker annotations as the ground truth, dna-nn is able to achieve high classification accuracy. It can annotate a human PacBio assembly in less than 1.5 hours using 16 CPUs, while RepeatMasker may take several days on 32 CPUs. Dna-nn is a practical tool to find these two types of satellites.

Dna-nn may have potentials to learn other types of sequence features. It can accurately identify Alu repeats as well. However, it has low sensitivity to Beta satellites and fails to learn L1 repeats even with more hidden neurons.

Installation

Dna-nn is implemented in C and includes the source code of the KANN deep-learning framework. The only external dependency is zlib. To compile,

git clone https://github.com/lh3/dna-nn
cd dna-nn
make

Usage

Applying a trained model

To find (ATTCC)n and alpha satellites for long contigs,

./dna-brnn -Ai models/attcc-alpha.knm -t16 seq.fa > seq.bed

The output is a BED file. A label 1 on the 4th column indicates the interval is a region of (AATTC)n ; label 2 indicates a region of alpha satellites.

Training

Training dna-nn requires sequences in the FASTQ format, where each "base quality" indicates the label of the corresponding base.

The following command lines shows how we generate the pre-trained model attcc-alpha.knm.

# Install the k8 javascript shell (if this has not been done)
curl -L https://github.com/attractivechaos/k8/releases/download/v0.2.4/k8-0.2.4.tar.bz2 | tar -jxf -
cp k8-0.2.4/k8-`uname -s` k8              # or copy it to a directory on your $PATH
# Run RepeatMasker to generate truth data
RepeatMasker -species human -pa 16 -e ncbi -xsmall -small -dir . train.fa
# The last column indicates the label of each region in the output BED
./k8 parse-rm.js train.fa.out > train.rm.bed
# Generate training data in FASTQ. Base qualities indicate labels.
./gen-fq -m2 train.fa train.rm.bed > train.lb2.fq
# Training. We trained 10 models with different random seeds
./dna-brnn -t8 -n32 -b5m -m50 -s14 -o attcc-alpha.knm train.lb2.fq

Evaluation

With truth annotations in the FASTQ format, we can evaluate the accuracy of a model with

./dna-brnn -Ei models/attcc-alpha.knm -t16 seq.fa > /dev/null

The stderr output gives the accuracy for each label.

Citing dna-nn

If you use dna-nn in your work, please cite its paper:

Li, H (2019) Identifying centromeric satellites with dna-brnn, Bioinformatics, doi:10.1093/bioinformatics/btz264

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].