All Projects → clemgoub → TypeTE

clemgoub / TypeTE

Licence: other
Genotyping of segregating mobile elements insertions

Programming Languages

perl
6916 projects
python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to TypeTE

Hail
Scalable genomic data analysis.
Stars: ✭ 706 (+4606.67%)
Mutual labels:  bioinformatics, vcf
Svtyper
Bayesian genotyper for structural variants
Stars: ✭ 79 (+426.67%)
Mutual labels:  bioinformatics, vcf
Helmsman
highly-efficient & lightweight mutation signature matrix aggregation
Stars: ✭ 19 (+26.67%)
Mutual labels:  bioinformatics, vcf
Vcfanno
annotate a VCF with other VCFs/BEDs/tabixed files
Stars: ✭ 259 (+1626.67%)
Mutual labels:  bioinformatics, vcf
Survivor
Toolset for SV simulation, comparison and filtering
Stars: ✭ 180 (+1100%)
Mutual labels:  bioinformatics, vcf
Pygeno
Personalized Genomics and Proteomics. Main diet: Ensembl, side dishes: SNPs
Stars: ✭ 261 (+1640%)
Mutual labels:  bioinformatics, vcf
16gt
Simultaneous detection of SNPs and Indels using a 16-genotype probabilistic model
Stars: ✭ 26 (+73.33%)
Mutual labels:  bioinformatics, vcf
Tiledb Vcf
Efficient variant-call data storage and retrieval library using the TileDB storage library.
Stars: ✭ 26 (+73.33%)
Mutual labels:  bioinformatics, vcf
Biosyntax
Syntax highlighting for computational biology
Stars: ✭ 164 (+993.33%)
Mutual labels:  bioinformatics, vcf
Genomics
A collection of scripts and notes related to genomics and bioinformatics
Stars: ✭ 101 (+573.33%)
Mutual labels:  bioinformatics, vcf
EarlGrey
Earl Grey: A fully automated TE curation and annotation pipeline
Stars: ✭ 25 (+66.67%)
Mutual labels:  bioinformatics, transposable-elements
Hap.py
Haplotype VCF comparison tools
Stars: ✭ 249 (+1560%)
Mutual labels:  bioinformatics, vcf
polyRAD
Genotype Calling with Uncertainty from Sequencing Data in Polyploids 🍌🍓🥔🍠🥝
Stars: ✭ 16 (+6.67%)
Mutual labels:  bioinformatics, genotype-likelihoods
Htslib
C library for high-throughput sequencing data formats
Stars: ✭ 529 (+3426.67%)
Mutual labels:  bioinformatics, vcf
Truvari
Structural variant toolkit for VCFs
Stars: ✭ 85 (+466.67%)
Mutual labels:  bioinformatics, vcf
Cyvcf2
cython + htslib == fast VCF and BCF processing
Stars: ✭ 243 (+1520%)
Mutual labels:  bioinformatics, vcf
SVCollector
Method to optimally select samples for validation and resequencing
Stars: ✭ 20 (+33.33%)
Mutual labels:  bioinformatics, vcf
SumStatsRehab
GWAS summary statistics files QC tool
Stars: ✭ 19 (+26.67%)
Mutual labels:  bioinformatics
geneview
Genomics data visualization in Python by using matplotlib.
Stars: ✭ 38 (+153.33%)
Mutual labels:  bioinformatics
ngstools
My own tools code for NGS data analysis (Next Generation Sequencing)
Stars: ✭ 28 (+86.67%)
Mutual labels:  bioinformatics

TypeTE v1.1

changelog v1.0 --> v1.1

  • Output vcf:
    • Cleanup output vcfs from irrelevant info fields in header
    • Reference genotypes are now printed in the traditionnal (REF/ALT) format, with REF = TE present = 0, and ALT = TE absent (deletion) = 1.
  • Hard code python2.7 in assembly script to match Spades requirements
  • Improve Non-Reference allele reconstruction script at TSD
  • Clean bugs and silence non-threatening error messages
  • Change parameterfile_NoRef.ini to parameterfile_NRef.ini to match regular script naming
  • Create tutorial section (upcoming manuscript)

see the TypeTE paper in NAR (2020)

Purpose

TypeTE is a pipeline dedicated to genotype segregating Mobile Element Insertion (MEI) previously scored with a MEI detection tool such as MELT (Mobile Element Locator Tool, Gardner et al., 2017). TypeTE extracts reads from each detected polymorphic MEI and reconstruct acurately both presence and absence alleles. Eventually, remapping of the reads at the infividual level allow to score the genotype of the MEI using a modified version of Li's et al. genotype likelihood. This method drammatically improves the quality of the genotypes of reported MEI and can be directly used after a MELT run on both non-reference and reference insertions.

picture alt

TypeTE is divided in two modules: "Non-reference" to genotype insertions absent from the reference genome and "Reference" to genotype TE copies present in the reference genomes.

Currently TypeTE is working only with Alu insertions in the human genome but will be soon available for L1, SVA as well as virtualy any retrotransposon in any organism with a reference genome.

This pipeline is developped by Jainy Thomas (University of Utah) and Clement Goubert (Cornell University). Elaborated with the collaboration of Jeffrey M. Kidd (University of Michigan)

Please adress all you questions and comments using the "issue" tab of the repository. This allows the community to search and find directly answers to their issues. If help is not comming, you can email your questions at goubert.clement[at]gmail.com

Installation

Dependencies

A docker container is coming for TypeTE! Stay tuned to get the latest version as soon as it comes out!

TypeTE rely on popular softwares often already in the toolbox of computational biologists! The following programs need to be installed and their path reported in the file "parameterfile_[No]Ref.init" Perl executable must be in the user path

Download and install

  1. Clone from git repository:
git clone --recurse-submodules https://github.com/clemgoub/TypeTE.git
cd TypeTE
  1. Complete the fields associated to the path of each dependent program in the files "parameterfile_Ref.init" and "parameterfile_NRef.init"

  2. And that's it!


Files preparation

You will need:

  1. A vcf/vcf.gz file (VCF) such as generated by the MELT discovery workflow. Examples are available in the folder "test_data". The vcf file must contain on Reference or Non-reference loci according to the module chosen. Loci/individuals must be sampled from the original vcf/vcf.gz using the following flag --recode-INFO-all in vcftools so the subsetted vcf will be compatible with TypeTE. If a new vcf is created specially for TypeTE, the following tags must be present in the "INFO" field (column) for non-reference loci only:
  • MEINFO= with predicted subfamily (Repbase name) and orientation of the TE (ex: MEINFO=AluYa5,.,.,+ | if the subfamily is unknown: MEINFO=AluUndef;.,.,+)
  • TSD= to indicate the predicted TSD (ex: TSD=AATAGAATTAGCAATTTTG | if no TSD detected TSD=null)

example:

##fileformat=VCFv4.1
##<HEADER OF THE VCF FILE>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA07056	NA11830	NA12144
1	72639020	ALU_umary_ALU_244	C	<INS:ME:ALU>	.	.	MEINFO=AluUndef,4,281,-;TSD=AGCAATCTTATTTTC	GT	0|1	0|0	0|1
10 69994906 ALU_umary_ALU_8067 G <INS:ME:ALU> . . MEINFO=AluUndef,8,280,+;TSD=AATAGAATTAGCAATTTTG GT 0|0 0|1 0|1

The "TSD=" and "MEINFO=" might be in different orders in the column "INFO" (8) of the vcf without issue. These fields are not required for the Reference module where these will be extracted from the reference genome

  1. bam files for each individual found in the vcf file

  2. a two column tab separated table with the sample name and corresponding bam name (BAMFILE):

sample1 sample1-xxx-file.bam
sample2 sample2-yyy-file.bam
sample3 sample3-zzz-file.bam
  1. Reference genome (GENOME) in fasta format (to date tested with hg19 and hg38). In another reference genome is used, you will need to update the RepeatMasker track corresponding to your reference as well as the repeat you want to genotype.

  2. RepeatMasker Track a .bed files reporting each reference MEI insertion masked by RepeatMasker for the reference sequence provided. The family names must match the names of the consensus given in the RM_FASTA field. (provided by default for Alu on hg19 and hg38)

  3. RepeatMasker Consensus (RM_FASTA) a .fasta file with the consensus sequences of the repeats analysed (provided by default for Alu)

  4. Edit the file "parameterfile_NoRef.init" or "parameterfile_Ref.init" following the indications within:

### MAIN PARAMETERS

# user data
VCF="/workdir/cg629/bin/TypeTE/test_data/test_data_nonref.vcf" #Path to MELT vcf (.vcf or .vcf.gz) must contain INFO field with TSD and MEI type
BAMPATH="/workdir/cg629/Projects/TypeTE_tutorial/test_data/" # Path to the bams folder
BAMFILE="/workdir/cg629/bin/TypeTE/test_data/input_table.txt" # <indiv_name> <bam_name> (2 fields tab separated table)

# genome data
RM_TRACK="/workdir/cg629/bin/TypeTE/Ressources/RepeatMasker_Alu_hg19.bed" # set by default for hg19
RM_FASTA="/workdir/cg629/bin/TypeTE/Ressources/refinelib" # set by default to be compatible with the Repeat Masker track included in the package
GENOME="/workdir/cg629/Projects/testTypeTE/hs37d5.fa" # Path the the reference genome sequence

# output
OUTDIR="/workdir/cg629/Projects/TypeTE_tutorial" # Path to place the output directory (will be named after PROJECT); OUTDIR must exist
PROJECT="OUTPUTS_NRef_testdata" # Name of the project (name of the folder)

# multi-threading
individual_nb="1" # number of individual per job (try to minimize that number)
CPU="3" # number of CPU (try to maximize that number) # CPU x individual_nb >= total # of individuals

## non-mendatory parameters
MAP="NO" #OR NO (experimental)

### DEPENDENCIES PATH
# /!\ PERL MUST BE IN PATH /!\
PARALLEL="/programs/parallel/bin/parallel" #Path to the GNU Parallel program
PICARD="/programs/picard-tools-2.9.0" #Path to Picard Tools
BEDTOOLS="/programs/bedtools-2.27.1/bin/bedtools" #Path to bedtools executable
SEQTK="/programs/seqtk" #Path to seqtk executable
BAMUTILS="/programs/bamUtil" #Path to bamUtil
SPADES="/programs/spades-3.5.0/bin" #Path to spades bin directory (to locate spades.py and dispades.py)
MINIA="/workdir/cg629/bin/minia/build/bin" #Path to minia bin directory
CAP3="/workdir/cg629/bin/CAP3" #Path to CAP3 directory
BLAST="/programs/ncbi-blast-2.7.1+/bin" #Path to blast bin directory
BWA="/programs/bwa-0.5.9/bwa" #Path to bwa executable
BGZIP="bgzip" #Path to bgzip executable
TABIX="tabix" #Path to tabix executable

Running TypeTE

  1. Fill the appropriated parameterfile_[N]Ref.init according to your local paths and files
  2. Run the following command in the TypeTE folder:
nohup ./run_TypeTE_[N]Ref.sh &> TypeTE.log &

Use ./run_TypeTE_Ref.sh for reference insertions and ./run_TypeTE_NRef.sh for non-reference insertions.


Output

TypeTE outputs a vcf.gz file containing all individual genotypes with genotypes likelihoods. The vcf convention reports genotypes relative to the allele present in the reference genome, thus TypeTE reports Reference insertions as 0/0 (homozygous) or (0/1), with 1/1 genotypes being homozygous for the absence of TE. This pattern is the opposite for the Non-Reference insertions.

Test runs

Non-reference insertions

We have prepared a small tutorial/test-run to check if all the components of TypeTE works perfectly.

We are going to run the pipeline on 2 loci of 3 individuals from the 1000 Genome Project.

  1. Download the bam and bam.bai files

Within the TypeTE folder, type:

cd test_data
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA07056/alignment/NA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.bam
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA07056/alignment/NA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.bam.bai
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA11830/alignment/NA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.bam
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA11830/alignment/NA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.bam.bai
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA12144/alignment/NA12144.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.bam
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA12144/alignment/NA12144.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.bam.bai

The corresponding bam/bam.bai files will be downladed into /TypeTE/test_data

  1. Copy the parameterfile_NoRef.init template present in /TypeTE/test_data to the main folder
cp parameterfile_NRef.init ../
cd ../
  1. Edit the parameterfile_NRef.init according to your dependancies and local path.

  2. Run TypeTE

nohup ./run_TypeTE_NR.sh &> TypeTE_TESTRUN.log &
  1. Expected results

The genotypes from the original vcf (<>/TypeTE/test_data/test_data_nonref.vcf) are the following

NA07056 NA11830 NA12144
1_72639020 0/1 0/0 0/1
10_69994906 0/0 0/1 0/1

The new genotypes should be

NA07056 NA11830 NA12144
1_72639020 1/1 0/1 0/1
10_69994906 0/0 1/1 0/1

Reference-insertions

We will here genotype two reference loci in the same three individuals:

  1. Copy the parameterfile_Ref.init present in /TypeTE/test_data to the main folder
cp test_data/parameterfile_Ref.init .
  1. Edit the parameterfile_Ref.init according to your dependancies and local path (but do not change anything else!)

  2. Run TypeTE

nohup ./run_TypeTE_Ref.sh &> TypeTE_TESTRUN_ref.log &
  1. Expected results

The genotypes from the original vcf (<>/TypeTE/test_data/test_data_ref.vcf) are the following

NA07056 NA11830 NA12144
5_88043130 0/1 1/1 0/1
6_7717368 0/1 0/1 0/1

The new genotypess should be

NA07056 NA11830 NA12144
5_88043130 1/1 0/1 0/1
6_7717368 1/1 1/1 0/1
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].