All Projects → SchulzLab → ORNA

SchulzLab / ORNA

Licence: MIT license
Fast in-silico normalization algorithm for NGS data

Programming Languages

C++
36643 projects - #6 most used programming language
CMake
9771 projects
shell
77523 projects

Projects that are alternatives of or similar to ORNA

MG-RAST
The MG-RAST Backend -- the API server
Stars: ✭ 39 (+85.71%)
Mutual labels:  metagenomics, metagenomic-analysis
GREIN
GREIN : GEO RNA-seq Experiments Interactive Navigator
Stars: ✭ 40 (+90.48%)
Mutual labels:  rna-seq, rna-seq-analysis
ideal
Interactive Differential Expression AnaLysis - DE made accessible and reproducible
Stars: ✭ 24 (+14.29%)
Mutual labels:  rna-seq, rna-seq-analysis
alevin-fry
🐟 🔬🦀 alevin-fry is an efficient and flexible tool for processing single-cell sequencing data, currently focused on single-cell transcriptomics and feature barcoding.
Stars: ✭ 78 (+271.43%)
Mutual labels:  rna-seq
CoNekT
CoNekT (short for Co-expression Network Toolkit) is a platform to browse co-expression data and enable cross-species comparisons.
Stars: ✭ 17 (-19.05%)
Mutual labels:  rna-seq
CellNet
CellNet: network biology applied to stem cell engineering
Stars: ✭ 39 (+85.71%)
Mutual labels:  rna-seq
NGS
Next-Gen Sequencing tools from the Horvath Lab
Stars: ✭ 30 (+42.86%)
Mutual labels:  rna-seq
tailseeker
Software for measuring poly(A) tail length and 3′-end modifications using a high-throughput sequencer
Stars: ✭ 17 (-19.05%)
Mutual labels:  rna-seq
lncpipe
UNDER DEVELOPMENT--- Analysis of long non-coding RNAs from RNA-seq datasets
Stars: ✭ 24 (+14.29%)
Mutual labels:  rna-seq-analysis
RNASeq
RNASeq pipeline
Stars: ✭ 30 (+42.86%)
Mutual labels:  rna-seq
CellO
CellO: Gene expression-based hierarchical cell type classification using the Cell Ontology
Stars: ✭ 34 (+61.9%)
Mutual labels:  rna-seq
dropClust
Version 2.1.0 released
Stars: ✭ 19 (-9.52%)
Mutual labels:  rna-seq
Maaslin2
MaAsLin2: Microbiome Multivariate Association with Linear Models
Stars: ✭ 76 (+261.9%)
Mutual labels:  metagenomics
gene-oracle
Feature extraction algorithm for genomic data
Stars: ✭ 13 (-38.1%)
Mutual labels:  rna-seq
class-norm
Class Normalization for Continual Zero-Shot Learning
Stars: ✭ 34 (+61.9%)
Mutual labels:  normalization
MetaOmGraph
MetaOmGraph: a workbench for interactive exploratory data analysis of large expression datasets
Stars: ✭ 30 (+42.86%)
Mutual labels:  rna-seq
WeTextProcessing
Text Normalization & Inverse Text Normalization
Stars: ✭ 213 (+914.29%)
Mutual labels:  normalization
velodyn
Dynamical systems methods for RNA velocity analysis
Stars: ✭ 16 (-23.81%)
Mutual labels:  rna-seq
picardmetrics
🚦 Run Picard on BAM files and collate 90 metrics into one file.
Stars: ✭ 38 (+80.95%)
Mutual labels:  rna-seq
MINTIE
Method for Identifying Novel Transcripts and Isoforms using Equivalence classes, in cancer and rare disease.
Stars: ✭ 24 (+14.29%)
Mutual labels:  rna-seq

About

The de bruijn graph (DBG) is one of the most commonly used data structures for assembly of sequencing data. Reads from the sequencer are chopped into small words of size k (k-mers) which form the nodes of the DBG. Two nodes are connected by an edge if they have a k-1 overlap. Each edge can be labelled with a k+1-mer formed by merging the kmers of the two nodes. For instance, if an edge connects two nodes of kmers ATCG and TCGT, the edge can be labelled as ATCGT. Assembly is generated by traversing paths in this graph. With the advances in deep sequencing technologies, assembling high coverage datasets has become a challenge in terms of memory and runtime requirements. Hence, read normalization, a lossy read filtering approach is gaining a lot of attention. Although current normalization algorithms are efficient, they provide no guarantee to preserve important k-mers that form connections between different genomic regions in the graph. There is a possibility that the resultant assembly is fragmented. In this work, normalization is phrased as a set multicover problem on reads and a linear time heuristic algorithm is proposed, named ORNA (Optimized Read Normalization Algorithm). ORNA normalizes to the minimum number of reads required to retain all labels (k+1-mers) and inturn all kmers and relative label abundances from the original dataset. Hence, no connections from the original graph are lost and coverage information is preserved.

When to use ORNA

ORNA is a read normalization software developed in spirit of Diginorm. ORNA is computationally inexpensive and it guarantees the preservation of all kmers from the original dataset. It can be used if the user has a high coverage dataset but does not have enough computational power (in particular memory but also limited time) in order to conduct a de novo assembly, because it removes the redundancy in your data. It can also be used to merge many sequencing datasets. The user must be aware that using ORNA (or in that case any normalization software) might have a significant impact on the assemblies produced as it is highly dependent on the dataset.

Enhancements to ORNA

We have implemented two additional options in ORNA to improve the reduction performance using either abundance values of kmers in reads or base quality scores.

ORNA-Q (parameter: -sorting 1):

In this mode, ORNA apart from preserving all the labels from the original dataset, also maximizes the total read quality score for the normalized dataset. The read quality score of a read is defined as the sum of phred qualities of bases in the read. ORNA-Q sorts the input dataset using read quality scores using a counting sort procedure before reduction.

ORNA-K (parameter: -ksorting 1)

In this mode, the normalization algorithm maximizes the total read abundance score of the normalized dataset (apart from preserving all labels from the original dataset). The read abundance score of a read is defined as the median of abundances of kmers present in the read. ORNA-K sorts the input dataset using the median kmer abundances of the reads in the dataset and then uses ORNA for reduction.

ORNA Algorithm

1.  Input : Read set R, LogBase b, kmer size k
2.  Initialization: k'=k+1
3.                  n = NumberOfDistinctK'mers(R)
4.                  counter(0,...,n)=0
5.                  Rout=null
6.  Steps:
7.          for r in R:
8.              flag=0
9.              V'=ObtainK'mers(R)
10.             for v in V':
11.                if(counter(v) < min(abundance(v), log_b(abundance(v)))) then:
12.                  counter(v)++
13.                  flag=1
14.                end if
15.              end for
16.              if flag!=0 then:
17.                Rout = Rout U r
18.              end if
19.          end for
20. Output: Rout
  • ORNA uses the GATB version 1.2.2 to store the kmer information
  • It reduces the abundance of a kmer to a value which is equal to the logarithmic transformation of the abundance. The base b of the logarithm is provided by the user.
  • ORNA was tested on two de bruijn graph based assemblers namely Oases and TransABySS and also worked for the assembly of metagenomics data.

Points to be noted

  • Currently, as ORNA retains all the kmers from the original dataset, it would also retain erroneous kmers. Thus ORNA reduces more reads, like any other tool for read reduction, when the data is error corrected. In case of RNA-seq or other non-uniform data we suggest to use the SEECER algorithm that proved to work well with ORNA.
  • ORNA-Q, ORNA-K and ORNA's paired-end mode currently does not support multithreading. Work is in progress for this and will be included in the future versions of ORNA.

Version

Version 0.4

Contact

For questions or suggestions regarding ORNA contact

  • Dilip A Durai (ddurai_at_mmci.uni-saarland.de)
  • Marcel H Schulz (marcel.schulz_at_em.uni-frankfurt.de)

Download

There are two ways how you can access and use ORNA. Either download from github or through bioconda.

If you use bioconda then installation is as easy as:

  conda install ORNA

Alternatively, the software can be downloaded by using the following command

	git clone https://github.com/SchulzLab/ORNA

The downloaded folder should contain the following files and folders:

  • install.sh
  • gatb-core (it will be empty. Files would be copied in once the install script is run)
  • src(folder) (contains the source code for ORNA)

Pre-requisite

Linux operating system with gcc version >=4.7
All the analysis for the manuscript was performed on Debain 8 operating system

Installation

  • Run the following command for installation if you downloaded it from github.
  bash install.sh
  • The above command should create a build folder. The executable of ORNA will be in build/bin

ORNA parameters

./bin/ORNA -help

short explanation note
-help shows the help message
-sorting (0 or 1) quality based sorting of input data Default 0
-ksorting (0 or 1) kmer abundance based sorting of input data Default 0
-base Base value for the logarithmic function Default 1.7
-kmer the value of k for kmer size Default 21
-input Input fasta file (for single end mode)
-pair1 First mate of the pair (for paired-end mode)
-pair2 Second mate of the pair (for paired-end mode)
-output Prefix of the output file Default "Normalized"
-nb-cores number of cores (does not work for paired end mode) Default 1
-type type of the output file (fasta/fastq) Default fasta

kmer value:
This parameter represents the kmer size to be used for reduction. As we aim at preserving all the edge lables ((k+1)-mers) from the original dataset, internally the kmer size given by the user would be incremented by 1. For instance, if the user provides a kmer size of 21, then ORNA would increment the kmer size to 22 for all its calculations. All the analysis in the paper were done using a kmer size of 21 for reads having length of 50bps and 76bps. If you are running an DBG assembly afterwards, we recommend to use the smallest k-mer used in the assembler. Depending on the dataset memory and runtime requirements will change depending on k.

base:
This parameter represents the base of the logarithm function used to decide the new abundance of kmer. For instance if the original abundance of a kmer is 1000 and a base of 10 is selected as a parameter then the new abundance is set to log101000 = 3. The higher the base parameter the more reduction of the reads. According to the analysis done in ORNA paper, a base of 1.7 seems to be a good compromise between data reduction and little loss in assembly quality. More examples can be found in this answer.

Running ORNA

  • To run ORNA, execute the following command from the installation directory
  ./build/bin/ORNA -input Dataset_name -output Output -base LogBase -kmer kmerSize -nb-cores NumberOfThreads -type fasta
  • Run ORNA in paired-end mode from the installation directory
  ./build/bin/ORNA -pair1 first_pair -pair2 second_pair -output Output -base LogBase -kmer kmerSize -type fasta
  • For instance, if the dataset to be normalized is named as input.fa, the following command would normalize the dataset using a log base of 1.7 and a kmer size of 21
  ./build/bin/ORNA -input input.fa -output output.fa -base 1.7 -kmer 21 -nb-cores 1

Citation

If you use ORNA in the normal mode (without quality of kmer abundance based sorting) in your work please cite:

Durai DA, Schulz MH. In-silico read normalization with set multicover optimization. Bioinformatics 2018 full text

If you use ORNA-Q/S (with quality or kmer abundance based sorting), please cite:

Durai DA, Schulz MH. Improving in-silico normalization using read weights. Scientific Reports 2019 full text

Acknowledgement

ORNA uses the GATB library for graph building and k-mer counting. We are thankful for their support.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].