All Projects → shubhamchandak94 → Spring

shubhamchandak94 / Spring

Licence: other
FASTQ compression

Programming Languages

c
50402 projects - #5 most used programming language
C++
36643 projects - #6 most used programming language
Cuda
1817 projects
python
139335 projects - #7 most used programming language
CMake
9771 projects
shell
77523 projects

Projects that are alternatives of or similar to Spring

fastq utils
Validation and manipulation of FASTQ files, scRNA-seq barcode pre-processing and UMI quantification.
Stars: ✭ 25 (-64.79%)
Mutual labels:  sequencing, fastq-files
zpaqfranz
Deduplicating archiver with encryption and paranoid-level tests. Swiss army knife for the serious backup and disaster recovery manager. Ransomware neutralizer. Win/Linux/Unix
Stars: ✭ 86 (+21.13%)
Mutual labels:  compression
torchprune
A research library for pytorch-based neural network pruning, compression, and more.
Stars: ✭ 133 (+87.32%)
Mutual labels:  compression
gzipped
Replacement for golang http.FileServer which supports precompressed static assets.
Stars: ✭ 86 (+21.13%)
Mutual labels:  compression
unishox js
JS Library for Guaranteed compression of Unicode short strings
Stars: ✭ 27 (-61.97%)
Mutual labels:  compression
ZstdKit
An Objective-C and Swift library for Zstd (Zstandard) compression and decompression.
Stars: ✭ 22 (-69.01%)
Mutual labels:  compression
arch-config
Scripts and Ansible playbook to setup Arch Linux on ZFS.
Stars: ✭ 36 (-49.3%)
Mutual labels:  compression
zstd-rs
zstd-decoder in pure rust
Stars: ✭ 148 (+108.45%)
Mutual labels:  compression
MEGA Manager
Cloud syncing manager for multiple MEGA cloud storage accounts with syncing, data gathering, compresssion and optimization capabilities.
Stars: ✭ 29 (-59.15%)
Mutual labels:  compression
ZetaProducerHtmlCompressor
A .NET port of Google’s HtmlCompressor library to minify HTML source code.
Stars: ✭ 31 (-56.34%)
Mutual labels:  compression
sqlite3-compression-encryption-vfs
Compression and Encryption Virtual File System for SQLite 3.
Stars: ✭ 88 (+23.94%)
Mutual labels:  compression
mtscomp
Multichannel time series lossless compression in pure Python based on NumPy and zlib
Stars: ✭ 20 (-71.83%)
Mutual labels:  compression
compbench
⌛ Benchmark and visualization of various compression algorithms
Stars: ✭ 21 (-70.42%)
Mutual labels:  compression
Oroch
A C++ library for integer array compression
Stars: ✭ 22 (-69.01%)
Mutual labels:  compression
NBT
A java implementation of the NBT protocol, including a way to implement custom tags.
Stars: ✭ 128 (+80.28%)
Mutual labels:  compression
gubbins
Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins
Stars: ✭ 103 (+45.07%)
Mutual labels:  sequencing
dsp
DSP and filtering library
Stars: ✭ 36 (-49.3%)
Mutual labels:  compression
ANCOMBC
Differential abundance (DA) and correlation analyses for microbial absolute abundance data
Stars: ✭ 60 (-15.49%)
Mutual labels:  sequencing
web-config
A Rollup configuration to build modern web applications with sweet features as for example SCSS imports, Service Worker generation with Workbox, Karma testing, live reloading, coping resources, chunking, treeshaking, Typescript, license extraction, filesize visualizer, JSON import, budgets, build progress, minifying and compression with brotli a…
Stars: ✭ 17 (-76.06%)
Mutual labels:  compression
lossyless
Generic image compressor for machine learning. Pytorch code for our paper "Lossy compression for lossless prediction".
Stars: ✭ 81 (+14.08%)
Mutual labels:  compression

SPRING

C/C++ CI

Bioinformatics publication

Check out specialized tool for compressing nanopore long reads: https://github.com/qm2/NanoSpring

SPRING is a compression tool for Fastq files (containing up to 4.29 Billion reads):

  • Near-optimal compression ratios for single-end and paired-end datasets
  • Fast and memory-efficient decompression
  • Supports variable length short reads of length upto 511 bases (without -l flag)
  • Supports variable length long reads of arbitrary length (upto 4.29 Billion) (with -l flag). This mode directly applies general purpose compression (BSC) to reads and so compression gains might be lower than those without -l flag.
  • Supports lossless compression of reads, quality scores and read identifiers
  • Supports reordering of reads (while preserving read pairing information) to boost compression
  • Supports quantization of quality values using QVZ, Illumina 8-level binning and binary thresholding
  • Supports decompression of a subset of reads (random access)
  • Supports gzipped fastq files as input (output) during (de)compression
  • Tested on Linux and macOS

Note: If you want to use SPRING only as a tool for reordering reads (approximately according to genome position), take a look at the reorder-only branch.

Install with conda on Linux

To install directly from source or to install on OSX, follow the instructions in the next section.

Spring is now available on conda via the bioconda channel. See this page for installation instructions for conda. Once conda is installed, do the following to install spring.

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda install spring

Note that if spring is installed this way, it should be invoked with the command spring rather than ./spring. The bioconda help page shows the commands if you wish to install spring in an environment. Also note that the bioconda version is compiled using SSE4.1 instruction set to allow portability across machines. You might get slightly better performance by compiling using the instructions below that use all available instructions on the target machine. Also, for older processors that don't support SSE4.1 instructions, you might get Illegal instruction error. In such cases, please use the instructions below.

Download

git clone https://github.com/shubhamchandak94/SPRING.git

Install

The instructions below will create the spring executable in the build directory inside SPRING. If you plan to build and run SPRING on separate architectures, then you might need to remove/comment the line set(FLAGS "${FLAGS} -march=native") in CMakeLists.txt (or use flags based on the target architecture). You can also use the -Dspring_optimize_for_portability=ON option for cmake that enables only the SSE4.1 instructions that should work on most processors.

On Linux with cmake installed and version at least 3.9 (check using cmake --version):

cd SPRING
mkdir build
cd build
cmake ..
make

On Linux with cmake not installed or with version older than 3.12:

cd SPRING
mkdir build
cd build
wget https://cmake.org/files/v3.12/cmake-3.12.4.tar.gz
tar -xzf cmake-3.12.4.tar.gz
cd cmake-3.12.4
./configure
make
cd ..
./cmake-3.12.4/bin/cmake ..
make

On macOS, install GCC compiler since Clang has issues with OpenMP library:

  • Install HomeBrew (https://brew.sh/)
  • Install GCC (this step will be faster if Xcode command line tools are already installed using xcode-select --install):
brew update
brew install gcc@9
  • Set environment variables:
export CC=gcc-9
export CXX=g++-9
  • Delete CMakeCache.txt (if present) from the build directory
  • Follow the steps above for Linux

Usage

Run the spring executable /PATH/TO/spring (or just spring if installed with conda) with the options below:

Allowed options:
  -h [ --help ]                   produce help message
  -c [ --compress ]               compress
  -d [ --decompress ]             decompress
  --decompress-range arg          --decompress-range start end
                                  (optional) decompress only reads (or read
                                  pairs for PE datasets) from start to end
                                  (both inclusive) (1 <= start <= end <=
                                  num_reads (or num_read_pairs for PE)). If -r
                                  was specified during compression, the range
                                  of reads does not correspond to the original
                                  order of reads in the FASTQ file.
  -i [ --input-file ] arg         input file name (two files for paired end)
  -o [ --output-file ] arg        output file name (for paired end
                                  decompression, if only one file is specified,
                                  two output files will be created by suffixing
                                  .1 and .2.)
  -w [ --working-dir ] arg (=.)   directory to create temporary files (default
                                  current directory)
  -t [ --num-threads ] arg (=8)   number of threads (default 8)
  -r [ --allow-read-reordering ]  do not retain read order during compression
                                  (paired reads still remain paired)
  --no-quality                    do not retain quality values during
                                  compression
  --no-ids                        do not retain read identifiers during
                                  compression
  -q [ --quality-opts ] arg       quality mode: possible modes are
                                  1. -q lossless (default)
                                  2. -q qvz qv_ratio (QVZ lossy compression,
                                  parameter qv_ratio roughly corresponds to
                                  bits used per quality value)
                                  3. -q ill_bin (Illumina 8-level binning)
                                  4. -q binary thr high low (binary (2-level)
                                  thresholding, quality binned to high if >=
                                  thr and to low if < thr)
  -l [ --long ]                   Use for compression of arbitrarily long read
                                  lengths. Can also provide better compression
                                  for reads with significant number of indels.
                                  -r disabled in this mode. For Illumina short
                                  reads, compression is better without -l flag.
  -g [ --gzipped_fastq ]          enable if compression input is gzipped fastq
                                  or to output gzipped fastq during
                                  decompression
  --gzip-level arg (=6)           gzip level (0-9) to use during decompression 
                                  if -g flag is specified (default: 6)
  --fasta-input                   enable if compression input is fasta file
                                  (i.e., no qualities)                                

Note that the SPRING compressed files are tar archives consisting of the different compressed streams, although we recommend using the .spring extension as in the examples shown below.

Resource usage

For the memory and CPU performance for SPRING, please see the paper and the associated supplementary material. Note that SPRING uses some temporary disk space, and can fail if the disk space is not sufficient. Assuming that qualities and ids are not being discarded and SPRING is operating in the short read mode, the additional temporary disk usage is around 10-30% of the original uncompressed file (on the lower end when quality values are from newer Illumina machines and are more compressible) when -r flag is not specified (i.e., default lossless mode). When -r flag is specified, SPRING writes all the quality values and read ids to a temporary file leading to significantly higher temporary disk usage - closer to 70-80% of the original file size. Note that these figures are approximate and include the space needed for the final compressed file.

Example Usage of SPRING

This section contains several examples for SPRING compression and decompression with various modes and options. The compressed SPRING file uses the .spring extension as a convention. If installed using conda, use the command spring instead of ./spring.

For compressing file_1.fastq and file_2.fastq losslessly using default 8 threads (Lossless).

./spring -c -i file_1.fastq file_2.fastq -o file.spring

For compressing file_1.fastq.gz and file_2.fastq.gz (gzipped fastq files) losslessly using default 8 threads (Lossless).

./spring -c -i file_1.fastq.gz file_2.fastq.gz -o file.spring -g

Using 16 threads (Lossless).

./spring -c -i file_1.fastq file_2.fastq -o file.spring -t 16

Compressing with only paired end info preserved, ids not stored, qualities compressed after Illumina binning (Recommended lossy mode for older Illumina machines. For Novaseq files, lossless quality compression is recommmended).

./spring -c -i file_1.fastq file_2.fastq -r --no-ids -q ill_bin -o file.spring

Compressing with only paired end info preserved, ids not stored, qualities binary thresholded (qv < 20 binned to 6 and qv >= 20 binned to 40).

./spring -c -i file_1.fastq file_2.fastq -r --no-ids -q binary 20 40 6 -o file.spring

Compressing with only paired end info preserved, ids not stored, qualities quantized using qvz with approximately 1 bit used per quality value.

./spring -c -i file_1.fastq file_2.fastq -r --no-ids -q qvz 1.0 -o file.spring

Compressing only reads and ids.

./spring -c -i file_1.fastq file_2.fastq --no-quality -o file.spring

Compressing single-end long read Fastq losslessly.

./spring -c -l -i file.fastq  -o file.spring

For single end file, compressing without order preserved.

./spring -c -i file.fastq -r -o file.spring

For single end file, compressing with order preserved (lossless).

./spring -c -i file.fastq -o file.spring

Decompressing (single end) to file.fastq.

./spring -d -i file.spring -o file.fastq

Decompressing (single end) to file.fastq, only decompress reads from 400 to 10000000.

./spring -d -i file.spring -o file.fastq --decompress-range 400 1000000

Decompressing (paired end) to file.fastq.1 and file.fastq.2.

./spring -d -i file.spring -o file.fastq

Decompressing (paired end) to file_1.fastq and file_2.fastq.

./spring -d -i file.spring -o file_1.fastq file_2.fastq

Decompressing (paired end) to file_1.fastq.gz and file_2.fastq.gz.

./spring -d -i file.spring -o file_1.fastq.gz file_2.fastq.gz -g

Decompressing (paired end) to file_1.fastq and file_2.fastq, only decompress pairs from 4000000 to 8000000.

./spring -d -i file.spring -o file_1.fastq file_2.fastq --decompress-range 4000000 8000000

Compressing file_1.fasta and file_2.fasta (fasta files without qualities) losslessly using default 8 threads (Lossless).

./spring -c -i file_1.fasta file_2.fasta -o file.spring --fasta-input

Compressing (paired end) to file_1.fasta and file_2.fasta (previous example contd.).

./spring -d -i file.spring -o file_1.fasta file_2.fasta
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].