All Projects → aleimba → bac-genomics-scripts

aleimba / bac-genomics-scripts

Licence: GPL-3.0 license
Collection of scripts for bacterial genomics

Programming Languages

perl
6916 projects
shell
77523 projects

Projects that are alternatives of or similar to bac-genomics-scripts

catch
A package for designing compact and comprehensive capture probe sets.
Stars: ✭ 55 (+41.03%)
Mutual labels:  science, genomics, ngs, sequencing
Gatk
Official code repository for GATK versions 4 and up
Stars: ✭ 1,002 (+2469.23%)
Mutual labels:  science, genomics, ngs, sequencing
Deepvariant
DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
Stars: ✭ 2,404 (+6064.1%)
Mutual labels:  science, genomics, ngs, sequencing
STing
Ultrafast sequence typing and gene detection from NGS raw reads
Stars: ✭ 15 (-61.54%)
Mutual labels:  genomics, ngs, computational-biology, mlst
Galaxy
Data intensive science for everyone.
Stars: ✭ 812 (+1982.05%)
Mutual labels:  science, genomics, ngs, sequencing
Jvarkit
Java utilities for Bioinformatics
Stars: ✭ 313 (+702.56%)
Mutual labels:  science, genomics, ngs
Genomics
A collection of scripts and notes related to genomics and bioinformatics
Stars: ✭ 101 (+158.97%)
Mutual labels:  science, genomics, sequencing
Ngless
NGLess: NGS with less work
Stars: ✭ 115 (+194.87%)
Mutual labels:  science, genomics, ngs
Htsjdk
A Java API for high-throughput sequencing data (HTS) formats.
Stars: ✭ 220 (+464.1%)
Mutual labels:  genomics, ngs, sequencing
Ugene
UGENE is free open-source cross-platform bioinformatics software
Stars: ✭ 112 (+187.18%)
Mutual labels:  science, ngs, sequencing
bsu
🎓Repository for university labs on FAMCS, BSU
Stars: ✭ 91 (+133.33%)
Mutual labels:  science, unix
msk-STAPLE
STAPLE (Shared Tools for Automatic Personalised Lower Extremity modelling) consists of a collection of methods for generating skeletal models from three-dimensional bone geometries, usually segmented from medical images. The methods are currently being expanded to create complete musculoskeletal models.
Stars: ✭ 39 (+0%)
Mutual labels:  science, computational-biology
atacr
Analysing Capture Seq Count Data
Stars: ✭ 14 (-64.1%)
Mutual labels:  science, genomics
Gwa tutorial
A comprehensive tutorial about GWAS and PRS
Stars: ✭ 303 (+676.92%)
Mutual labels:  unix, genomics
Genometools
GenomeTools genome analysis system.
Stars: ✭ 186 (+376.92%)
Mutual labels:  annotation, genomics
adapt
A package for designing activity-informed nucleic acid diagnostics for viruses.
Stars: ✭ 16 (-58.97%)
Mutual labels:  science, genomics
web-verse
Toolbox for deep, resilient, markup-invariant linking into HTML documents without their cooperation
Stars: ✭ 25 (-35.9%)
Mutual labels:  science, annotation
Rnaseq Workflow
A repository for setting up a RNAseq workflow
Stars: ✭ 170 (+335.9%)
Mutual labels:  science, sequencing
Vcfanno
annotate a VCF with other VCFs/BEDs/tabixed files
Stars: ✭ 259 (+564.1%)
Mutual labels:  annotation, genomics
Dram
Distilled and Refined Annotation of Metabolism: A tool for the annotation and curation of function for microbial and viral genomes
Stars: ✭ 47 (+20.51%)
Mutual labels:  annotation, genomics

DOI

bac-genomics-scripts

A collection of scripts intended for bacterial genomics (some might also be useful for eukaryotes) from high-throughput sequencing (aka next-generation sequencing).

Summary

  • Basic stats for bases and reads in FASTQ files: calc_fastq-stats
  • Concatenate multi-sequence files (RichSeq EMBL or GENBANK format, or FASTA format) to a single artificial file: cat_seq
  • COG (cluster of orthologous groups) classification of proteins: cdd2cog
  • Extraction of protein/nucleotide sequences from CDSs: cds_extractor
  • MLST (multilocus sequence typing) assignment and allele extraction for Escherichia coli (Achtman scheme): ecoli_mlst
  • Create a feature table for all annotated primary features in RichSeq (EMBL or GENBANK format) files: genomes_feature_table
  • Deprecated! Batch downloading of sequences from NCBI's FTP server: ncbi_ftp_download
  • Order sequence entries in FASTA/FASTQ files according to an ID list: order_fastx
  • Create an ortholog/paralog annotation comparison matrix from Proteinortho5 output: po2anno
  • Calculate stats and plot venn diagrams for genome groups according to orthologs/paralogs from Proteinortho5 output, i.e. overall presence/absence statistics for groups of genomes and not simply single genomes: po2group_stats
  • Strain panel query protein search with BLASTP plus concise hit summary, optional alignment, and presence/absence matrix. Also included, scripts to transpose the matrix and calculate overall presence/absence statistics for groups of columns in the matrix: prot_finder
  • Rename FASTA ID lines and optionally numerate them: rename_fasta_id
  • Reverse complement (multi-)sequence files (RichSeq EMBL or GENBANK format, or FASTA format): revcom_seq
  • Regions of difference (ROD) detection in genomes with BLASTN: rod_finder
  • NGS paired-end library insert size estimation from BAM/SAM: sam_insert-size
  • Randomly subsample FASTA, FASTQ, or TEXT files with reservoir sampling: sample_fastx-txt
  • Convert a sequence file to another format with BioPerl: seq_format-converter
  • Manual curation of annotation in NCBI's TBL format (e.g. from Prokka automatic annotation) in a spreadsheet software: tbl2tab
  • Truncate sequence files (RichSeq EMBL or GENBANK format, or FASTA format) according to given coordinates: trunc_seq
  • And an assortment of smaller scripts for tasks like (not yet uploaded to GitHub): alignment format converters, dnadiff, GC% calculation etc.

Introduction

All the scripts here are written in Perl (some include bash shell wrappers).

Each script is hosted in its own folder, so that a separate README.md can be included for more information. However, all of the Perl scripts include additionally a usage/help text or a comprehensive POD (Plain Old Documentation) by calling the script either without arguments/options or option -h|-help.

The scripts are only tested under UNIX, some won't run in a Windows environment (because of included UNIX commands). If you are on Windows an alternative might be Cygwin.

Installation recommendations

To download the repository, use either the 'Download ZIP' link after clicking the green 'Clone or download' button at the top or clone the repository with git:

git clone https://github.com/aleimba/bac-genomics-scripts.git

If there is an update to this GitHub repository (see above commits and releases), you can refresh your local repository by using the following command inside the local folder:

git pull

To install the scripts, copy them e.g. to a home /bin folder in your PATH and make them executable

$ find . \( -name '*.pl' -o -name '*.sh' -o -name '*.fas' -o -name '*.txt' \) -exec cp {} ~/bin \;
$ chmod u+x ~/bin/*.pl

the scripts can then be run everywhere on your system. Of course you can just call them directly by prefexing perl to the command or a './' for bash wrappers:

$ perl /path/to/script/script.pl <options>

or

$ ./script.sh <arguments>

Single scripts can be downloaded as well. For this purpose click on the folder you're interested in and then on the link of the script. There click on the Raw button and save this page to a file (without Raw you'll get an unusable html file). This is also true for other files (e.g. PDFs etc.).

Dependencies

All scripts are tested with Perl v5.22.1.

Most of the Perl scripts include modules from BioPerl as stated in their respective README.md or POD, which as a consequence has to be installed on your system. For BioPerl installation instructions see the website (Installation).

Some scripts need additional Perl modules, which will be stated in the associated README.md or POD. If they're not installed yet on your system get them from CPAN (installation instructions can be found on the website, see e.g. Getting Started...Installing Perl Modules or FAQ).

Furthermore, some scripts call upon statistical computing language R and dependent packages for plotting purposes (again see the respective README.md or POD).

UNIX loops

A very handy tip, if you want to run a script on all files in the current working directory you can use a loop in UNIX, e.g.:

$ for file in *.fasta; do perl script.pl "$file"; done

Windows - UNIX linebreak problems

At last, some of the scripts don't like Windows formatted line breaks, you might consider running these input files through a nifty UNIX utility called dos2unix:

$ dos2unix input

Citation

For now cite the latest major release (tag: bovine_ecoli_mastitis) hosted on Zenodo:

Leimbach A. 2016. bac-genomics-scripts: Bovine E. coli mastitis comparative genomics edition. Zenodo. http://dx.doi.org/10.5281/zenodo.215824.

Also, all scripts have a version number (see option -v), which might be included in a materials and methods section.

License

All scripts are licensed under GPLv3 which is contained in the file LICENSE.

Author - contact

For help, suggestions, bugs etc. use the GitHub issues or write an email to aleimba [at] gmx [dot] de.

Andreas Leimbach (Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].