All Projects → bpucker → MGSE

bpucker / MGSE

Licence: GPL-3.0 license
Mapping-based Genome Size Estimation (MGSE) performs an estimation of a genome size based on a read mapping to an existing genome sequence assembly.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to MGSE

CliqueSNV
No description or website provided.
Stars: ✭ 13 (-40.91%)
Mutual labels:  ngs, pacbio, illumina
Gatk
Official code repository for GATK versions 4 and up
Stars: ✭ 1,002 (+4454.55%)
Mutual labels:  genomics, genome, ngs
catch
A package for designing compact and comprehensive capture probe sets.
Stars: ✭ 55 (+150%)
Mutual labels:  genomics, genome, ngs
fast-sg
Fast-SG: An alignment-free algorithm for ultrafast scaffolding graph construction from short or long reads.
Stars: ✭ 22 (+0%)
Mutual labels:  pacbio, illumina, genome-assembly
ngs-preprocess
A pipeline for preprocessing NGS data from Illumina, Nanopore and PacBio technologies
Stars: ✭ 22 (+0%)
Mutual labels:  ngs, pacbio, illumina
haslr
A fast tool for hybrid genome assembly of long and short reads
Stars: ✭ 68 (+209.09%)
Mutual labels:  genomics, pacbio, genome-assembly
Deepvariant
DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
Stars: ✭ 2,404 (+10827.27%)
Mutual labels:  genomics, genome, ngs
GenomeAnalysisModule
Welcome to the website and github repository for the Genome Analysis Module. This website will guide the learning experience for trainees in the UBC MSc Genetic Counselling Training Program, as they embark on a journey to learn about analyzing genomes.
Stars: ✭ 19 (-13.64%)
Mutual labels:  genomics, genome
mccortex
De novo genome assembly and multisample variant calling
Stars: ✭ 105 (+377.27%)
Mutual labels:  genomics, genome-assembly
Jvarkit
Java utilities for Bioinformatics
Stars: ✭ 313 (+1322.73%)
Mutual labels:  genomics, ngs
Galaxy
Data intensive science for everyone.
Stars: ✭ 812 (+3590.91%)
Mutual labels:  genomics, ngs
tiptoft
Predict plasmids from uncorrected long read data
Stars: ✭ 27 (+22.73%)
Mutual labels:  genomics, pacbio
CAMSA
CAMSA: a tool for Comparative Analysis and Merging of Scaffold Assemblies
Stars: ✭ 18 (-18.18%)
Mutual labels:  genomics, genome-assembly
Pygeno
Personalized Genomics and Proteomics. Main diet: Ensembl, side dishes: SNPs
Stars: ✭ 261 (+1086.36%)
Mutual labels:  genomics, genome
reg-gen
Regulatory Genomics Toolbox: Python library and set of tools for the integrative analysis of high throughput regulatory genomics data.
Stars: ✭ 64 (+190.91%)
Mutual labels:  genomics, ngs
companion
This repository has been archived, currently maintained version is at https://github.com/iii-companion/companion
Stars: ✭ 21 (-4.55%)
Mutual labels:  genomics, genome
Deeptools
Tools to process and analyze deep sequencing data.
Stars: ✭ 448 (+1936.36%)
Mutual labels:  genomics, ngs
Ngless
NGLess: NGS with less work
Stars: ✭ 115 (+422.73%)
Mutual labels:  genomics, ngs
Ribbon
A genome browser that shows long reads and complex variants better
Stars: ✭ 184 (+736.36%)
Mutual labels:  genomics, genome
Viral Ngs
Viral genomics analysis pipelines
Stars: ✭ 150 (+581.82%)
Mutual labels:  genomics, genome

DOI

Mapping-based Genome Size Estimation (MGSE)

MGSE can harness the power of files generated in genome sequencing projects to predict the genome size. Required are the FASTA file containing a high continuity assembly and a BAM file with all available reads mapped to this assembly. The script construct_cov_file.py (https://doi.org/10.1186/s12864-018-5360-z) allows the generation of a COV file based on the (sorted) BAM file (also possible via MGSE directly). Next, this COV file can be used by MGSE to calculate the coverage in provided reference regions and to calculate the total number of mapped bases. Both values are subjected to the genome size estimation. Providing accurate reference regions is crucial for this genome size estimation. Different alternatives were evaluated and actual single copy BUSCOs (https://busco.ezlab.org/) appear to be the best choice. Running BUSCO prior to MGSE will generate all necessary files.

MGSE workflow (Pucker, 2021; doi:10.1101/607390)
Usage:
  python MGSE.py [--cov <COV_FILE_OR_DIR> | --bam <BAM_FILE_OR_DIR>] --out <DIR>
                 [--ref <TSV> | --gff <GFF> | --busco <FULL_TABLE.TSV> | --all]

Mandatory:
  Coverage data (choose one)
  --cov STR          Coverage file (COV) created by construct_cov_file.py or directory containing
                     multiple coverage files
  --bam STR          BAM file to automatically create the coverage file
  
  Output directory
  --out STR          Output directory

  Reference regions to calculate average coverage (choose one)
  --ref STR          File containing TAB-separated chromosome, start, and end
  --gff STR          GFF3 file containing genes
  --busco STR        BUSCO annotation file (full_table_busco_run.tsv)
  --all              Use all positions of the assembly
		
Optional:
  --black STR       Sequence ID list for exclusion
  --gzip            Search for files "*cov.gz" in --cov if this is a directory
  --bam_is_sorted   Do not sort BAM file
  --samtools STR    Full path to samtools (if not in your $PATH)
  --bedtools STR    Full path to bedtools (if not in your $PATH)
  --name STR        Prefix for output files []
  --m INT           Samtools sort memory [5000000000]
  --threads INT     Samtools sort threads [4]
  --plot TRUE|FALSE Activate or deactivate generation of figures via matplotlib[FALSE]
  --blackoff TRUE|FALSE Deactivate the black listing of contigs with high coverage values [FALSE]

WARNING:

  • if --busco is used, the BUSCO GFF3 files need to be in the default folder relative to the provided TSV file
  • MGSE requires absolute paths (at least use of absolute paths is recommended)
  • python 2.7.x is required for executing MGSE (transfer to python3 is planned)
  • Per default contigs with very high coverage values are put on a black list to prevent inflation of the genome size prediciton by plastome contigs (in plants). However, this function can be disabled via --blackoff to estimate genome sizes with more fragmented assemblies.

Possible reference regions:

  1. --ref A very simple TAB-separated text file with information about chromosome, start, and end of regions which should be used as a reference set for the coverage calculation.

  2. --gff A GFF3 file with genes which should serve as reference regions.

  3. --busco This will extract the single copy BUSCOs from the provided TSV file and retrieves the corresponding annotation from GFF3 files generated while running BUSCO.

  4. --all All positions of the assembly will be included in the average coverage calculation.

Usage:
  python construct_cov_file.py

Mandatory:
  --in STR          Bam file
  --out STR         Output file

Optional:
  --bam_is_sorted   Don't sort bam file
  --m INT           Samtools sort memory [5000000000]
  --threads INT     Samtools sort threads [4]

Reference:

Pucker B. Mapping-based genome size estimation. bioRxiv 607390; doi: https://doi.org/10.1101/607390

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].