Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → lh3 → Dna Nn

lh3 / Dna Nn

Model and predict short DNA sequence features with neural networks

Programming Languages

50402 projects - #5 most used programming language

Labels

deep-learning bioinformatics genomics

Projects that are alternatives of or similar to Dna Nn

Tools to process and analyze deep sequencing data.

Stars: ✭ 448 (+659.32%)

Mutual labels: bioinformatics, genomics

Data intensive science for everyone.

Stars: ✭ 812 (+1276.27%)

Mutual labels: bioinformatics, genomics

Ncbi Genome Download

Scripts to download genomes from the NCBI FTP servers

Stars: ✭ 494 (+737.29%)

Mutual labels: bioinformatics, genomics

The next version of bwa-mem

Stars: ✭ 408 (+591.53%)

Mutual labels: bioinformatics, genomics

Awesome Sequencing Tech Papers

A collection of publications on comparison of high-throughput sequencing technologies.

Stars: ✭ 21 (-64.41%)

Mutual labels: bioinformatics, genomics

📖🔬☕️ BioJava is an open-source project dedicated to providing a Java library for processing biological data.

Stars: ✭ 434 (+635.59%)

Mutual labels: bioinformatics, genomics

Scalable genomic data analysis.

Stars: ✭ 706 (+1096.61%)

Mutual labels: bioinformatics, genomics

Ultra-fast and memory-efficient (meta-)genome assembler

Stars: ✭ 343 (+481.36%)

Mutual labels: bioinformatics, genomics

Analysis pipelines for sequencing data

Stars: ✭ 43 (-27.12%)

Mutual labels: bioinformatics, genomics

Efficient variant-call data storage and retrieval library using the TileDB storage library.

Stars: ✭ 26 (-55.93%)

Mutual labels: bioinformatics, genomics

Python library to facilitate genome assembly, annotation, and comparative genomics

Stars: ✭ 404 (+584.75%)

Mutual labels: bioinformatics, genomics

Burrow-Wheeler Aligner for short-read alignment (see minimap2 for long-read alignment)

Stars: ✭ 970 (+1544.07%)

Mutual labels: bioinformatics, genomics

A modern genome browser built with JavaScript and HTML5.

Stars: ✭ 393 (+566.1%)

Mutual labels: bioinformatics, genomics

Official code repository for GATK versions 4 and up

Stars: ✭ 1,002 (+1598.31%)

Mutual labels: bioinformatics, genomics

A fast and sensitive gapped read aligner

Stars: ✭ 365 (+518.64%)

Mutual labels: bioinformatics, genomics

Python and C++ code for reading and writing genomics data.

Stars: ✭ 657 (+1013.56%)

Mutual labels: bioinformatics, genomics

Efficient pythonic random access to fasta subsequences

Stars: ✭ 307 (+420.34%)

Mutual labels: bioinformatics, genomics

Java utilities for Bioinformatics

Stars: ✭ 313 (+430.51%)

Mutual labels: bioinformatics, genomics

Stars: ✭ 23 (-61.02%)

Mutual labels: bioinformatics, genomics

A versatile pairwise aligner for genomic and spliced nucleotide sequences

Stars: ✭ 912 (+1445.76%)

Mutual labels: bioinformatics, genomics

View All Similar Projects ➔

Introduction

Dna-nn implements a proof-of-concept deep-learning model to learn relatively simple features on DNA sequences. So far it has been trained to identify the (ATTCC)n and alpha satellite repeats, which occupy over 5% of the human genome. Taking the RepeatMasker annotations as the ground truth, dna-nn is able to achieve high classification accuracy. It can annotate a human PacBio assembly in less than 1.5 hours using 16 CPUs, while RepeatMasker may take several days on 32 CPUs. Dna-nn is a practical tool to find these two types of satellites.

Dna-nn may have potentials to learn other types of sequence features. It can accurately identify Alu repeats as well. However, it has low sensitivity to Beta satellites and fails to learn L1 repeats even with more hidden neurons.

Installation

Dna-nn is implemented in C and includes the source code of the KANN deep-learning framework. The only external dependency is zlib. To compile,

git clone https://github.com/lh3/dna-nn
cd dna-nn
make

Usage

Applying a trained model

To find (ATTCC)n and alpha satellites for long contigs,

./dna-brnn -Ai models/attcc-alpha.knm -t16 seq.fa > seq.bed

The output is a BED file. A label 1 on the 4th column indicates the interval is a region of (AATTC)n ; label 2 indicates a region of alpha satellites.

Training

Training dna-nn requires sequences in the FASTQ format, where each "base quality" indicates the label of the corresponding base.

The following command lines shows how we generate the pre-trained model attcc-alpha.knm.

# Install the k8 javascript shell (if this has not been done)
curl -L https://github.com/attractivechaos/k8/releases/download/v0.2.4/k8-0.2.4.tar.bz2 | tar -jxf -
cp k8-0.2.4/k8-`uname -s` k8              # or copy it to a directory on your $PATH
# Run RepeatMasker to generate truth data
RepeatMasker -species human -pa 16 -e ncbi -xsmall -small -dir . train.fa
# The last column indicates the label of each region in the output BED
./k8 parse-rm.js train.fa.out > train.rm.bed
# Generate training data in FASTQ. Base qualities indicate labels.
./gen-fq -m2 train.fa train.rm.bed > train.lb2.fq
# Training. We trained 10 models with different random seeds
./dna-brnn -t8 -n32 -b5m -m50 -s14 -o attcc-alpha.knm train.lb2.fq

Evaluation

With truth annotations in the FASTQ format, we can evaluate the accuracy of a model with

./dna-brnn -Ei models/attcc-alpha.knm -t16 seq.fa > /dev/null

The stderr output gives the accuracy for each label.

Citing dna-nn

If you use dna-nn in your work, please cite its paper:

Li, H (2019) Identifying centromeric satellites with dna-brnn, Bioinformatics, doi:10.1093/bioinformatics/btz264

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 59

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (2) 🔗