All Projects → lh3 → Readfq

lh3 / Readfq

Fast multi-line FASTA/Q reader in several programming languages

Programming Languages

c
50402 projects - #5 most used programming language

Projects that are alternatives of or similar to Readfq

Ugene
UGENE is free open-source cross-platform bioinformatics software
Stars: ✭ 112 (-12.5%)
Mutual labels:  bioinformatics
Blacklist
Application for making ENCODE Blacklists
Stars: ✭ 119 (-7.03%)
Mutual labels:  bioinformatics
Sarek
Detect germline or somatic variants from normal or tumour/normal whole-genome or targeted sequencing
Stars: ✭ 124 (-3.12%)
Mutual labels:  bioinformatics
Fqtools
An efficient FASTQ manipulation suite
Stars: ✭ 114 (-10.94%)
Mutual labels:  bioinformatics
Dna2vec
dna2vec: Consistent vector representations of variable-length k-mers
Stars: ✭ 117 (-8.59%)
Mutual labels:  bioinformatics
Scgen
Single cell perturbation prediction
Stars: ✭ 122 (-4.69%)
Mutual labels:  bioinformatics
Biofast
Benchmarking programming languages/implementations for common tasks in Bioinformatics
Stars: ✭ 112 (-12.5%)
Mutual labels:  bioinformatics
Masurca
Stars: ✭ 128 (+0%)
Mutual labels:  bioinformatics
Hicexplorer
HiCExplorer is a powerful and easy to use set of tools to process, normalize and visualize Hi-C data.
Stars: ✭ 116 (-9.37%)
Mutual labels:  bioinformatics
Deepecg
ECG classification programs based on ML/DL methods
Stars: ✭ 124 (-3.12%)
Mutual labels:  bioinformatics
Cooler
A cool place to store your Hi-C
Stars: ✭ 112 (-12.5%)
Mutual labels:  bioinformatics
Ngless
NGLess: NGS with less work
Stars: ✭ 115 (-10.16%)
Mutual labels:  bioinformatics
Kmer Cnt
Code examples of fast and simple k-mer counters for tutorial purposes
Stars: ✭ 124 (-3.12%)
Mutual labels:  bioinformatics
Bio4j
Bio4j abstract model and general entry point to the project
Stars: ✭ 113 (-11.72%)
Mutual labels:  bioinformatics
Plip
Protein-Ligand Interaction Profiler - Analyze and visualize non-covalent protein-ligand interactions in PDB files according to 📝 Salentin et al. (2015), https://www.doi.org/10.1093/nar/gkv315
Stars: ✭ 123 (-3.91%)
Mutual labels:  bioinformatics
Bioconvert
Bioconvert is a collaborative project to facilitate the interconversion of life science data from one format to another.
Stars: ✭ 112 (-12.5%)
Mutual labels:  bioinformatics
Circlator
A tool to circularize genome assemblies
Stars: ✭ 121 (-5.47%)
Mutual labels:  bioinformatics
Splatter
Simple simulation of single-cell RNA sequencing data
Stars: ✭ 128 (+0%)
Mutual labels:  bioinformatics
Somalier
fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"
Stars: ✭ 128 (+0%)
Mutual labels:  bioinformatics
Krakenuniq
🐙 KrakenUniq: Metagenomics classifier with unique k-mer counting for more specific results
Stars: ✭ 123 (-3.91%)
Mutual labels:  bioinformatics

Readfq is a collection of routines for parsing the FASTA/FASTQ format. It seamlessly parses both FASTA and multi-line FASTQ with a simple interface.

Readfq is first implemented in a single C header file and then ported to Lua, Perl and Python as a single function less than 50 lines. For users of scripting languages, I encourage to copy-and-paste the function instead of using readfq as a library. It is always good to avoid unnecessary library dependencies.

Readfq also strives for efficiency. The C implementation is among the fastest (if not the fastest). The Python and Perl implementations are several to tens of times faster than the official Bio* implementations. If you can speed up readfq further, please let me know. I am not good at optimizing programs in scripting languages. Thank you.

As to licensing, the C implementation is distributed under the MIT license. Implementations in other languages are released without a license. Just copy and paste. You do not need to acknowledge me. The following shows a brief example for each programming language:

Perl

my @aux = undef; # this is for keeping intermediate data while (my ($name, $seq, $qual) = readfq(*STDIN, @aux)) { print "$seq\n"; }

Python: generator function

for name, seq, qual in readfq(sys.stdin): print seq

-- Lua: closure for name, seq, qual in readfq(io.stdin) do print seq end

/* Go */ package main

import ( "fmt" "bufio" "github.com/drio/drio.go/bio/fasta" )

func main() { var fqr fasta.FqReader fqr.Reader = bufio.NewReader(os.Stdin) for r, done := fqr.Iter(); !done; r, done = fqr.Iter() { fmt.Println(r.Seq) } }

/* C */ #include <zlib.h> #include <stdio.h> #include "kseq.h" KSEQ_INIT(gzFile, gzread)

int main() { gzFile fp; kseq_t *seq; fp = gzdopen(fileno(stdin), "r"); seq = kseq_init(fp); while (kseq_read(seq) >= 0) puts(seq->seq.s); kseq_destroy(seq); gzclose(fp); return 0; }

Some naive benchmarks. To convert a FASTQ containing 25 million 100bp reads to FASTA, FASTX-Toolkit (parsing 4-line FASTQ only) takes 325.0 CPU seconds and EMBOSS' seqret 247.8 seconds. My seqtk, which uses the kseq.h library, finishes the task in 24.6 seconds, 10X faster. For retrieving 25k sequences by name from the same FASTQ, BioPython takes 963 seconds, while readfq.py takes 136 seconds; BioPerl takes more than 40 minutes (killed), while readfq.pl 273 seconds. Seqtk takes 29 seconds.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].