Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → lh3 → Bioawk

lh3 / Bioawk

BWK awk modified for biological data

Programming Languages

50402 projects - #5 most used programming language

Labels

bioinformatics

Projects that are alternatives of or similar to Bioawk

Cutadapt

Cutadapt removes adapter sequences from sequencing reads

Stars: ✭ 340 (-26.41%)

Mutual labels: bioinformatics

Bwa Mem2

The next version of bwa-mem

Stars: ✭ 408 (-11.69%)

Mutual labels: bioinformatics

Wdl

Workflow Description Language - Specification and Implementations

Stars: ✭ 438 (-5.19%)

Mutual labels: bioinformatics

Deeppurpose

A Deep Learning Toolkit for DTI, Drug Property, PPI, DDI, Protein Function Prediction (Bioinformatics)

Stars: ✭ 342 (-25.97%)

Mutual labels: bioinformatics

Jbrowse

A modern genome browser built with JavaScript and HTML5.

Stars: ✭ 393 (-14.94%)

Mutual labels: bioinformatics

Rush

A cross-platform command-line tool for executing jobs in parallel

Stars: ✭ 421 (-8.87%)

Mutual labels: bioinformatics

Biopandas

Working with molecular structures in pandas DataFrames

Stars: ✭ 329 (-28.79%)

Mutual labels: bioinformatics

Deeptools

Tools to process and analyze deep sequencing data.

Stars: ✭ 448 (-3.03%)

Mutual labels: bioinformatics

Jcvi

Python library to facilitate genome assembly, annotation, and comparative genomics

Stars: ✭ 404 (-12.55%)

Mutual labels: bioinformatics

Circosjs

d3 library to build circular graphs

Stars: ✭ 436 (-5.63%)

Mutual labels: bioinformatics

Plantcv

Plant image analysis using OpenCV

Stars: ✭ 352 (-23.81%)

Mutual labels: bioinformatics

Bowtie2

A fast and sensitive gapped read aligner

Stars: ✭ 365 (-21%)

Mutual labels: bioinformatics

Containers

Bioinformatics containers

Stars: ✭ 435 (-5.84%)

Mutual labels: bioinformatics

Megahit

Ultra-fast and memory-efficient (meta-)genome assembler

Stars: ✭ 343 (-25.76%)

Mutual labels: bioinformatics

Mmseqs2

MMseqs2: ultra fast and sensitive search and clustering suite

Stars: ✭ 441 (-4.55%)

Mutual labels: bioinformatics

Grakel

A scikit-learn compatible library for graph kernels

Stars: ✭ 330 (-28.57%)

Mutual labels: bioinformatics

Sambamba

Tools for working with SAM/BAM data

Stars: ✭ 409 (-11.47%)

Mutual labels: bioinformatics

Salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment

Stars: ✭ 456 (-1.3%)

Mutual labels: bioinformatics

Vsearch

Versatile open-source tool for microbiome analysis

Stars: ✭ 444 (-3.9%)

Mutual labels: bioinformatics

Biojava

📖🔬☕️ BioJava is an open-source project dedicated to providing a Java library for processing biological data.

Stars: ✭ 434 (-6.06%)

Mutual labels: bioinformatics

View All Similar Projects ➔

Introduction

Bioawk is an extension to Brian Kernighan's awk, adding the support of several common biological data formats, including optionally gzip'ed BED, GFF, SAM, VCF, FASTA/Q and TAB-delimited formats with column names. It also adds a few built-in functions and an command line option to use TAB as the input/output delimiter. When the new functionality is not used, bioawk is intended to behave exactly the same as the original BWK awk.

The original awk requires a YACC-compatible parser generator (e.g. Byacc or Bison). Bioawk further depends on zlib so as to work with gzip'd files.

New functionality

Command line option `-t`

Using this option is equivalent to

bioawk -F'\t' -v OFS="\t"

Command line option `-c arg`

This option specifies the input format. When this option is in use, bioawk will seamlessly add variables that name the fields, based on either the format or the first line of the input, depending arg. This option also enables bioawk to read gzip'd files. The argument arg may take the following values:

help. List the supported formats and the naming variables.
hdr or header. Name each column based on the first line in the input. Special characters in the first are converted to underscore. For example:
```
  grep -v ^## in.vcf | bioawk -tc hdr '{print $_CHROM,$POS}'
```
prints the CHROM and POS columns of the input VCF file.
sam, vcf, bed and gff. SAM, VCF, BED and GFF formats.
fastx. This option regards a FASTA or FASTQ as a TAB delimited file with four columns: sequence name, sequence, quality and FASTA/Q comment, such that various fields can be retrieved with column names. See also example 4 in the following.

New built-in functions

See awk.1.

Examples

List the supported formats:
```
 bioawk -c help
```

Extract unmapped reads without header:

 bioawk -c sam 'and($flag,4)' aln.sam.gz

Extract mapped reads with header:
```
 bioawk -Hc sam '!and($flag,4)'
```

Reverse complement FASTA:

 bioawk -c fastx '{print ">"$name;print revcomp($seq)}' seq.fa.gz

Create FASTA from SAM (uses revcomp if FLAG & 16)

 samtools view aln.bam | \
     bioawk -c sam '{s=$seq; if(and($flag, 16)) {s=revcomp($seq)} print ">"$qname"\n"s}'

Print the genotypes of sample foo and bar from a VCF:

 grep -v ^## in.vcf | bioawk -tc hdr '{print $foo,$bar}'

Potential limitations

When option -c is in use, bioawk replaces the line reading module of awk. The new line reading function parses FASTA and FASTQ files and seamlessly reads gzip'ed files. However, the new code does not fully mimic the original code. It may fail in corner cases (though this has not happened yet). Thus when -c is not specified, awk falls back to the original line reading code and does not support gzip'ed input.
When -c is in use, several strings allocated in the new line reading module are not freed in the end. These will be reported by valgrind as "still reachable". To some extent, these are not memory leaks.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 462

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (20) 🔗

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

lh3 / Bioawk

Programming Languages

Labels

Projects that are alternatives of or similar to Bioawk

Introduction

New functionality

Command line option -t

Command line option -c arg

New built-in functions

Examples

Potential limitations

Command line option `-t`

Command line option `-c arg`