All Projects → fgvieira → Ngsf

fgvieira / Ngsf

Licence: other
Estimation of per-individual inbreeding coefficients under a probabilistic framework

Programming Languages

c
50402 projects - #5 most used programming language

Labels

Projects that are alternatives of or similar to Ngsf

catch
A package for designing compact and comprehensive capture probe sets.
Stars: ✭ 55 (+450%)
Mutual labels:  ngs
DNAscan
DNAscan is a fast and efficient bioinformatics pipeline that allows for the analysis of DNA Next Generation sequencing data, requiring very little computational effort and memory usage.
Stars: ✭ 36 (+260%)
Mutual labels:  ngs
Galaxy
Data intensive science for everyone.
Stars: ✭ 812 (+8020%)
Mutual labels:  ngs
iSkyLIMS
is an open-source LIMS (laboratory Information Management System) for Next Generation Sequencing sample management, statistics and reports, and bioinformatics analysis service management.
Stars: ✭ 33 (+230%)
Mutual labels:  ngs
peppy
Project metadata manager for PEPs in Python
Stars: ✭ 29 (+190%)
Mutual labels:  ngs
platon
Identification & characterization of bacterial plasmid-borne contigs from short-read draft assemblies.
Stars: ✭ 52 (+420%)
Mutual labels:  ngs
CONSENT
Scalable long read self-correction and assembly polishing with multiple sequence alignment
Stars: ✭ 47 (+370%)
Mutual labels:  ngs
Fusiondirect.jl
(No maintenance) Detect gene fusion directly from raw fastq files
Stars: ✭ 23 (+130%)
Mutual labels:  ngs
gencore
Generate duplex/single consensus reads to reduce sequencing noises and remove duplications
Stars: ✭ 91 (+810%)
Mutual labels:  ngs
Htslib
C library for high-throughput sequencing data formats
Stars: ✭ 529 (+5190%)
Mutual labels:  ngs
OpenGene.jl
(No maintenance) OpenGene, core libraries for NGS data analysis and bioinformatics in Julia
Stars: ✭ 60 (+500%)
Mutual labels:  ngs
fastq utils
Validation and manipulation of FASTQ files, scRNA-seq barcode pre-processing and UMI quantification.
Stars: ✭ 25 (+150%)
Mutual labels:  ngs
Jvarkit
Java utilities for Bioinformatics
Stars: ✭ 313 (+3030%)
Mutual labels:  ngs
SVCollector
Method to optimally select samples for validation and resequencing
Stars: ✭ 20 (+100%)
Mutual labels:  ngs
Ngsdist
Estimation of pairwise distances under a probabilistic framework
Stars: ✭ 6 (-40%)
Mutual labels:  ngs
atropos
An NGS read trimming tool that is specific, sensitive, and speedy. (production)
Stars: ✭ 109 (+990%)
Mutual labels:  ngs
ctdna-pipeline
A simplified pipeline for ctDNA sequencing data analysis
Stars: ✭ 29 (+190%)
Mutual labels:  ngs
Tif
Transposon Insertion Finder - Detection of new insertions in NGS data
Stars: ✭ 9 (-10%)
Mutual labels:  ngs
Manorm
A robust model for quantitative comparison of ChIP-Seq data sets.
Stars: ✭ 16 (+60%)
Mutual labels:  ngs
Deeptools
Tools to process and analyze deep sequencing data.
Stars: ✭ 448 (+4380%)
Mutual labels:  ngs

ngsF

ngsF is a program to estimate per-individual inbreeding coefficients under a probabilistic framework that takes the uncertainty of genotype's assignation into account. It avoids calling genotypes by using genotype likelihoods or posterior probabilities.

Citation

ngsF was published in 2013 at Genome Research, so please cite it if you use it in your work:

Vieira FG, Fumagalli M, Albrechtsen A, Nielsen R
Estimating inbreeding coefficients from NGS data: Impact on genotype calling and allele frequency estimation.
Genome Research (2013) 23: 1852-1861

Installation

ngsF can be easily installed but has some external dependencies:

  • Mandatory:
    • gcc: >= 4.9.2 tested on Debian 7.8 (wheezy)
    • zlib: v1.2.7 tested on Debian 7.8 (wheezy)
    • gsl : v1.15 tested on Debian 7.8 (wheezy)
  • Optional (only needed for testing or auxilliary scripts):
    • md5sum

To install the entire package just download the source code:

% git clone https://github.com/fgvieira/ngsF.git

and run:

% cd ngsF
% make

To run the tests (only if installed through ngsTools):

% make test

Executables are built into the main directory. If you wish to clean all binaries and intermediate files:

% make clean

Usage

% ./ngsF [options] --n_ind INT --n_sites INT --glf glf/in/file --out output/file

Parameters

  • --glf FILE: Input GL file.
  • --init_values CHAR or FILE: Initial values of individual F and site frequency. Can be (r)andom, (e)stimated from data assuming a uniform prior, (u)niform at 0.01, or read from a FILE.
  • --calc_LRT: estimate MAFs and calculate lkl assuming F=0 (H0; null hypothesis) for a Likelihood Ratio Test (LRT); if parameters from a previous run (H1; alternative hypothesis) are provided (through --init_values), checks if estimates of F are significantly different from 0 through a LRT assuming a chi-square distribution with one degree of freedom.
  • --freq_fixed: assume initial MAF as fixed parameters (only estimates F)
  • --out FILE: Output file name.
  • --n_ind INT: Sample size (number of individuals).
  • --n_sites INT: Total number of sites.
  • --chunk_size INT: Size of each analysis chunk. [100000]
  • --approx_EM: Use the faster approximated EM ML algorithm
  • --call_geno: Call genotypes before running analyses.
  • --max_iters INT: Maximum number of EM iterations. [1500]
  • --min_iters INT: Minimum number of EM iterations. [10]
  • --min_epsilon FLOAT: Maximum RMSD between iterations to assume convergence. [1e-5]
  • --n_threads INT: Number of threads to use. [1]
  • --seed: Set seed for random number generator.
  • --quick: Quick run.
  • --verbose INT: Selects verbosity level. [1]

Input data

As input ngsF needs a Genotype Likelihood (GL) file, formatted as 3*n_ind*n_sites doubles in binary. It can be uncompressed [default] or in BGZIP format. If "-", reads uncompressed stream from STDIN. Currently, all sites in the file must be variable, so a previous SNP calling step is needed.

Ouput files

ngsF prints out two (or three) output files: the output file (specified with option --out), the parameters file (same name plus the suffix .pars), and (if --calc_LRT and --init_values have been specified) the LRT file (same name plus the suffis .lrt). The output file is a text file with the per-individual inbreeding coefficients, one per line. The parameters file is a binary file storing, as doubles, the final parameters, namely global log-likelihood (1), per-individual log-likelihood (N_IND), per-individual inbreeding coefficients (N_IND), and per-site minor allele frequencies (N_SITES). The LRT file is a text file with the global and per-individual likelihoods for H1 (alternative hypothesis; 1st column), H0 (null hypothesis; 2nd column), and p-value for rejection of H0 (following a chi2 distribution adn 1 degree of freedom; 3rd column).

Stopping Criteria

An issue on iterative algorithms is the stopping criteria. ngsF implements a dual condition threshold: relative difference in log-likelihood and estimates RMSD (F and freq). As for which threshold to use, simulations show that 1e-5 seems to be a reasonable value. However, if you're dealing with low coverage data (2x-3x), it might be worth to use lower thresholds (between 1e-6 and 1e-9).

Debug

Some available options are intended for debugging purposes only and should not be used in any real analysis!

  • --verbose: verbose values above 4
  • --quick: Only computes initial "freq" and "indF" values with no EM optimization.

Hints

  • Dataset: as a rule of thumb, use at least 1000 high confidence independent SNP sites.

  • Low coverage data: since the initial estimates are not reliable, it is recommended to use random starting points and more strict stopping criteria (eg. -init_values r -min_epsilon 1e-9).

  • High coverage data: although F is not really useful in the prior, it seems lower initial values perform better (-init_values u).

  • Memory Usage: By default ngsF loads the entire file into memory. However, if the file is too big and not enough memory is available, ngsF can also load chunks as they are needed. This is implemented on the BGZF library (from SAMTOOLS package), which allows for fast random access to BGZIP compressed files through an internal virtual index. This library can only deal with BGZIP files but a binary to compress them is provided. If you want to use this library just add -D_USE_BGZF to the FLAGS on the Makefile.

Contact

For questions on the usage of ngsF please check the tutorial or contact Dr Filipe G. Vieira.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].