All Projects → fgvieira → Ngsdist

fgvieira / Ngsdist

Licence: gpl-3.0
Estimation of pairwise distances under a probabilistic framework

Labels

Projects that are alternatives of or similar to Ngsdist

rctl
A set of command line tools based on R and JavaScript.
Stars: ✭ 15 (+150%)
Mutual labels:  ngs
reg-gen
Regulatory Genomics Toolbox: Python library and set of tools for the integrative analysis of high throughput regulatory genomics data.
Stars: ✭ 64 (+966.67%)
Mutual labels:  ngs
platon
Identification & characterization of bacterial plasmid-borne contigs from short-read draft assemblies.
Stars: ✭ 52 (+766.67%)
Mutual labels:  ngs
CONSENT
Scalable long read self-correction and assembly polishing with multiple sequence alignment
Stars: ✭ 47 (+683.33%)
Mutual labels:  ngs
iSkyLIMS
is an open-source LIMS (laboratory Information Management System) for Next Generation Sequencing sample management, statistics and reports, and bioinformatics analysis service management.
Stars: ✭ 33 (+450%)
Mutual labels:  ngs
peppy
Project metadata manager for PEPs in Python
Stars: ✭ 29 (+383.33%)
Mutual labels:  ngs
readfq
A simple tool to calculate reads number and total base count in FASTQ file
Stars: ✭ 19 (+216.67%)
Mutual labels:  ngs
Htslib
C library for high-throughput sequencing data formats
Stars: ✭ 529 (+8716.67%)
Mutual labels:  ngs
OpenGene.jl
(No maintenance) OpenGene, core libraries for NGS data analysis and bioinformatics in Julia
Stars: ✭ 60 (+900%)
Mutual labels:  ngs
ctdna-pipeline
A simplified pipeline for ctDNA sequencing data analysis
Stars: ✭ 29 (+383.33%)
Mutual labels:  ngs
atropos
An NGS read trimming tool that is specific, sensitive, and speedy. (production)
Stars: ✭ 109 (+1716.67%)
Mutual labels:  ngs
SVCollector
Method to optimally select samples for validation and resequencing
Stars: ✭ 20 (+233.33%)
Mutual labels:  ngs
gencore
Generate duplex/single consensus reads to reduce sequencing noises and remove duplications
Stars: ✭ 91 (+1416.67%)
Mutual labels:  ngs
MTBseq source
MTBseq is an automated pipeline for mapping, variant calling and detection of resistance mediating and phylogenetic variants from illumina whole genome sequence data of Mycobacterium tuberculosis complex isolates.
Stars: ✭ 26 (+333.33%)
Mutual labels:  ngs
Jvarkit
Java utilities for Bioinformatics
Stars: ✭ 313 (+5116.67%)
Mutual labels:  ngs
PHAT
Pathogen-Host Analysis Tool - A modern Next-Generation Sequencing (NGS) analysis platform
Stars: ✭ 17 (+183.33%)
Mutual labels:  ngs
fastq utils
Validation and manipulation of FASTQ files, scRNA-seq barcode pre-processing and UMI quantification.
Stars: ✭ 25 (+316.67%)
Mutual labels:  ngs
Galaxy
Data intensive science for everyone.
Stars: ✭ 812 (+13433.33%)
Mutual labels:  ngs
Deeptools
Tools to process and analyze deep sequencing data.
Stars: ✭ 448 (+7366.67%)
Mutual labels:  ngs
DNAscan
DNAscan is a fast and efficient bioinformatics pipeline that allows for the analysis of DNA Next Generation sequencing data, requiring very little computational effort and memory usage.
Stars: ✭ 36 (+500%)
Mutual labels:  ngs

ngsDist

ngsDist is a program to estimate pairwise genetic distances directly, taking the uncertainty of genotype's assignation into account. It does so by avoiding genotype calling and using genotype likelihoods or posterior probabilities.

Citation

ngsDist was published in 2015 at Biological Journal of the Linnean Society, so please cite it if you use it in your work:

Vieira FG, Lassalle F, Korneliussen TS, Fumagalli M
Improving the estimation of genetic distances from Next-Generation Sequencing data
Biological Journal of the Linnean Society (2015) doi: 10.1111/bij.12511

Installation

ngsDist can be easily installed but has some external dependencies:

  • Mandatory:
    • gcc: >= 4.9.2 tested on Debian 7.8 (wheezy)
    • zlib: v1.2.7 tested on Debian 7.8 (wheezy)
    • gsl : v1.15 tested on Debian 7.8 (wheezy)
  • Optional (only needed for testing or auxilliary scripts):
    • md5sum

To install the entire package just download the source code:

% git clone https://github.com/fgvieira/ngsDist.git

and run:

% cd ngsDist
% make

To run the tests (only if installed through ngsTools):

% make test

Executables are built into the main directory. If you wish to clean all binaries and intermediate files:

% make clean

Usage

% ./ngsDist [options] --geno /path/to/input/file --n_ind INT --n_sites INT --out /path/to/output/file

Parameters

  • --geno FILE: input file with genotypes, genotype likelihoods or genotype posterior probabilities.
  • --n_ind INT: sample size (number of individuals).
  • --n_sites INT: number of sites in input file.
  • --tot_sites INT: total number of sites in dataset.
  • --labels FILE: labels, one per line, of the input sequences.
  • --probs: is the input genotype probabilities (likelihoods or posteriors)?
  • --log_scale: Ii the input in log-scale?.
  • --call_geno: call genotypes before running analyses.
  • --N_thresh DOUBLE: minimum threshold to consider site; missing data if otherwise (assumes -call_geno)
  • --call_thresh DOUBLE: minimum threshold to call genotype; left as is if otherwise (assumes -call_geno)
  • --pairwise_del: pairwise deletion of missing data.
  • --avg_nuc_dist: use average number of nucleotide differences as distance (by default, ngsDist uses genotype distances based on allele frequency differences). Only pairs of heterozygous positions are actually affected when using this option, with their distance being 0.5 (instead of 0 by default).
  • --indep_geno: assume independence between genotypes?
  • --n_boot_rep INT: number of bootstrap replicates [0].
  • --boot_block_size INT: block size (in alignment positions) for bootstrapping [1].
  • --out FILE: output file name.
  • --n_threads INT: number of threads to use. [1]
  • --verbose INT: selects verbosity level. [1]
  • --seed INT: random number generator seed (only for the bootstrap analysis).

Input data

As input, ngsDist accepts both genotypes, genotype likelihoods (GL) or genotype posterior probabilities (GP). Genotypes must be input as gziped TSV with one row per site and one column per individual n_sites.n_ind and genotypes coded as [-1, 0, 1, 2]. The file can have a header and an arbitrary number of columns preceeding the actual data (that will all be ignored), much like the Beagle file format (link). As for GL and GP, ngsDist accepts both gzipd TSV and binary formats, but with 3 columns per individual 3.n_sites.n_ind and, in the case of binary, the GL/GP coded as doubles.

Evolutionary models

ngsDist calculates a "p-distance", being its biggest strength the possibility of taking genotype uncertainty (from genotype likelihoods) into account. It currently does not use any evolutionary model (e.g. JC, K2P), but it is something that could be added in the future.

Bootstrap Trees

If you want branch support values on your tree, you can use ngsDist with the option --n_boot_rep and --boot_block_size to bootstrap the input data. ngsDist will output one distance matrix (the first) for the input full dataset, plus --n_boot_rep matrices for each of the bootstrap replicates. After, infer a tree for each of the matrices using the program of your choice and plot them. For example, using FastME on a dataset with 5 bootstrap replicates:

fastme -T 20 -i testA_8B.dist -s -D 6 -o testA_8B.nwk

split the input dataset tree from the bootstraped ones:

head -n 1 testA_8B.nwk > testA_8B.main.nwk
tail -n +2 testA_8B.nwk | awk 'NF' > testA_8B.boot.nwk

and, to place supports on the main tree, use RAxML:

raxmlHPC -f b -t testA_8B.main.nwk -z testA_8B.boot.nwk -m GTRCAT -n testA_8B

or RAxML-NG:

raxml-ng --support --tree testA_8B.main.nwk --bs-trees testA_8B.boot.nwk --prefix testA_8B

Thread pool

The thread pool implementation was adapted from Mathias Brossard's and is freely available from: https://github.com/mbrossard/threadpool

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].