All Projects → marieBvr → TEs_genes_relationship_pipeline

marieBvr / TEs_genes_relationship_pipeline

Licence: GPL-3.0 license
Distribution of TEs and their relationship to genes in host genome

Programming Languages

python
139335 projects - #7 most used programming language
r
7636 projects

Projects that are alternatives of or similar to TEs genes relationship pipeline

ilus
A handy variant calling pipeline generator for whole genome re-sequencing (WGS) and whole exom sequencing data (WES) analysis. 一个简易且全面的 WGS/WES 分析流程生成器.
Stars: ✭ 64 (+392.31%)
Mutual labels:  bioinformatics-pipeline
plasmidtron
Assembling the cause of phenotypes and genotypes from NGS data
Stars: ✭ 27 (+107.69%)
Mutual labels:  bioinformatics-pipeline
streamformatics
Real-time species-typing visualisation for nanopore data.
Stars: ✭ 13 (+0%)
Mutual labels:  bioinformatics-pipeline
open-cravat
A modular annotation tool for genomic variants
Stars: ✭ 74 (+469.23%)
Mutual labels:  bioinformatics-pipeline
nPhase
Ploidy agnostic phasing pipeline and algorithm
Stars: ✭ 18 (+38.46%)
Mutual labels:  bioinformatics-pipeline
tiptoft
Predict plasmids from uncorrected long read data
Stars: ✭ 27 (+107.69%)
Mutual labels:  bioinformatics-pipeline
pypiper
Python toolkit for building restartable pipelines
Stars: ✭ 34 (+161.54%)
Mutual labels:  bioinformatics-pipeline
gubbins
Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins
Stars: ✭ 103 (+692.31%)
Mutual labels:  bioinformatics-pipeline
bystro
Bystro genetic analysis (annotation, filtering, statistics)
Stars: ✭ 31 (+138.46%)
Mutual labels:  bioinformatics-pipeline
gff3toembl
Converts Prokka GFF3 files to EMBL files for uploading annotated assemblies to EBI
Stars: ✭ 27 (+107.69%)
Mutual labels:  bioinformatics-pipeline
CD4-csaw
Reproducible reanalysis of a combined ChIP-Seq & RNA-Seq data set
Stars: ✭ 16 (+23.08%)
Mutual labels:  bioinformatics-pipeline
saffrontree
SaffronTree: Reference free rapid phylogenetic tree construction from raw read data
Stars: ✭ 17 (+30.77%)
Mutual labels:  bioinformatics-pipeline
conda-env-builder
Build and maintain multiple custom conda environments all in once place.
Stars: ✭ 18 (+38.46%)
Mutual labels:  bioinformatics-pipeline
TOGA
TOGA (Tool to infer Orthologs from Genome Alignments): implements a novel paradigm to infer orthologous genes. TOGA integrates gene annotation, inferring orthologs and classifying genes as intact or lost.
Stars: ✭ 35 (+169.23%)
Mutual labels:  bioinformatics-pipeline
snATAC
<<------ Use SnapATAC!!
Stars: ✭ 23 (+76.92%)
Mutual labels:  bioinformatics-pipeline
pisces
PISCES is a pipeline for rapid transcript quantitation, genetic fingerprinting, and quality control assessment of RNAseq libraries using Salmon.
Stars: ✭ 23 (+76.92%)
Mutual labels:  bioinformatics-pipeline
sentieon-dnaseq
Sentieon DNAseq
Stars: ✭ 18 (+38.46%)
Mutual labels:  bioinformatics-pipeline
assembly improvement
Improve the quality of a denovo assembly by scaffolding and gap filling
Stars: ✭ 46 (+253.85%)
Mutual labels:  bioinformatics-pipeline
snp-sites
Finds SNP sites from a multi-FASTA alignment file
Stars: ✭ 182 (+1300%)
Mutual labels:  bioinformatics-pipeline
genomedisco
Software for comparing contact maps from HiC, CaptureC and other 3D genome data.
Stars: ✭ 23 (+76.92%)
Mutual labels:  bioinformatics-pipeline

TEGRiP: Transposable Elements Genes RelationshIps Pipeline

TEGRiP is recommended by PCI Genomics 10.24072/pci.genomics.100010 so please cite 10.1101/2021.02.25.432867 if you use it.

Context

The main goal of this project is to find the positional relationships between Transposable Elements (TEs) and genes along genome.

The Apricot genome annotation has been used to validate our strategy (raw data available on ENA PRJEB42606). This pipeline can be used with custom TE annotation as well as de novo assembled genome of any kind of species.

You can learn more about this subject in the Contribution Section.

Introduction

Transposable elements are DNA fragment capable of moving from one place to another troughout the genome via a mecanism called transposition. There are different category/class of transposon. In this project we are going to focus on LTR (long terminal repeat). Learn more by clicking here

Requirement

To use this programme please clone the repository. Make sure that Python3 and Rstudio are installed on your machine. If it's not the case, check out the following links :

  • python 3 : link
  • Rstudio/R >= 4.0.2 : link

Example

Before running the programme for your own data, please use the testing data to check that everything works.

In the data folder, there are 2 files, one with the data regarding each gene and one regarding each transposon displayed in the diagram above. The python scripts are going to find for each transposable element the nearest gene before and after it. The script will also look for overlapping. The result file of this test data is located in the result folder.

Usage


Python Scripts

Input files

The programme takes in consideration the columns name so before using them please verify that each files have those columns :

  • Genome files
chromosome source feature start end score strand phase ID Attributes
  • TE files
Chromosome Length_Chr Type X Start End X X Strand X Attribute X Class TE_name X
  • LTR files
species ID dfam_target_name X X X X chromosome start end strand X annotation X X score

Pipeline

Scripts are time optimized using multiprocessing. Multiprocessing is a system that use multiple central processing units (CPUs) making the scripts run faster.

To run the script type the following line by replacing each file name by the real name.

To analyze general TE:

python3 Multiprocessing/Create_Data_multipro.py \
	-g data/Gene_testing_data.tsv \
	-te data/Transposon_testing_data.tsv \
	-o result/output_TE.tsv

To analyze LTR:

python3 Multiprocessing/Create_Data_LTR_multiprocessing.py \
	-g data/Gene_testing_data.tsv \
	-te data/LTR_testing_data.tsv \
	-o result/output_LTR.tsv

The script will take each file and extract all the data and put them in lists of dictionaries. Then for each TE, it will check the nearest gene whether it's subset,superset, upstream/downstream or whether it's an upstream/downstream overlap.

Output file

The script create a new output file (.tsv) which will be used to make the statistical analysis.

A prettier and easy reading table can be generated with:

python3 Multiprocessing/Create_Data_multipro_reformatted.py \
	-g data/Gene_testing_data.tsv \
	-te data/Transposon_testing_data.tsv \
	-o result/output_TE.tsv

python3 Multiprocessing/Create_Data_LTR_multiprocessing_reformatted.py \
	-g data/Gene_testing_data.tsv \
	-te data/LTR_testing_data.tsv \
	-o result/output_LTR.tsv

R scripts

There are four R scripts allowing to report different kinds of information:

  • Count the number of TEs
  • Count the number of TEs with associated genes within a certain distance
  • Overlap statistics, which show how many TEs have an overlap with gene, both upstream and downstream.
  • Distance statistics, which show the number of each TE superfamily overlap with the closest gene (-i -1 -x 0) or within the distance of 0-500 bp (-i 0 -x 500), 500-1000 bp (-i 501 -x 1000), 1000-2000 bp (-i 1001 -x 2000) and more than 2000 bp. The subsets and supersets are not included in these counts. The input file is the result file obtained from the Python script.

The input file is the result file obtained with the Python script.

Rscript Rscript/number_te.r \
	-f result/output_TE.tsv \
	-o result/count_TE_transposons.pdf

Rscript Rscript/count_gene_associated_te.r \
	-f result/output_TE.tsv \
	-o result/count_TE_transposons.pdf \
	-i 0 \
	-x 2000

Rscript Rscript/Overlap_counting.r \
	-f result/output_TE.tsv \
	-p result/overlap_TE_results.pdf \
	-o result/overlap_TE_results.csv

Rscript Rscript/Distance_counting.r \
	-f result/output_TE.tsv \
	-p result/distance_TE_results.pdf \
	-o result/distance_TE_results.csv

Please note that the graph's legend will also need to be change according to the file and species.

Contribution

This programme has been developped by Caroline Meguerditchian and Ayse Ergun under the supervision of Marie Lefebvre and Quynh Trang-Bui.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].