All Projects → vpc-ccg → sedef

vpc-ccg / sedef

Licence: MIT license
Identification of segmental duplications in the genome

Programming Languages

C++
36643 projects - #6 most used programming language
Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to sedef

svtools
Tools for processing and analyzing structural variants.
Stars: ✭ 118 (+436.36%)
Mutual labels:  structural-variation
svict
Structural Variation and fusion detection using targeted sequencing data from circulating cell free DNA
Stars: ✭ 21 (-4.55%)
Mutual labels:  structural-variation
MA
The Modular Aligner and The Modular SV Caller
Stars: ✭ 39 (+77.27%)
Mutual labels:  structural-variation
pipeline-structural-variation
Pipeline for calling structural variations in whole genomes sequencing Oxford Nanopore data
Stars: ✭ 104 (+372.73%)
Mutual labels:  structural-variation
Circle-Map
A method for circular DNA detection based on probabilistic mapping of ultrashort reads
Stars: ✭ 45 (+104.55%)
Mutual labels:  structural-variation
arriba
Fast and accurate gene fusion detection from RNA-Seq data
Stars: ✭ 162 (+636.36%)
Mutual labels:  structural-variation
PopDel
Population-wide Deletion Calling
Stars: ✭ 31 (+40.91%)
Mutual labels:  structural-variation
simuG
simuG: a general-purpose genome simulator
Stars: ✭ 68 (+209.09%)
Mutual labels:  structural-variation
MINTIE
Method for Identifying Novel Transcripts and Isoforms using Equivalence classes, in cancer and rare disease.
Stars: ✭ 24 (+9.09%)
Mutual labels:  structural-variation
witty.er
What is true, thank you, ernestly. A large variant benchmarking tool analogous to hap.py for small variants.
Stars: ✭ 22 (+0%)
Mutual labels:  structural-variation
ACEseqWorkflow
Allele-specific copy number estimation with whole genome sequencing
Stars: ✭ 19 (-13.64%)
Mutual labels:  structural-variation
SVCollector
Method to optimally select samples for validation and resequencing
Stars: ✭ 20 (-9.09%)
Mutual labels:  structural-variation
arcsv
Complex structural variant detection from WGS data
Stars: ✭ 16 (-27.27%)
Mutual labels:  structural-variation
dysgu
dysgu-SV is a collection of tools for calling structural variants using short or long reads
Stars: ✭ 47 (+113.64%)
Mutual labels:  structural-variation
MUMandCo
MUM&Co is a simple bash script that uses Whole Genome Alignment information provided by MUMmer (only v4) to detect Structural Variation
Stars: ✭ 36 (+63.64%)
Mutual labels:  structural-variation
squid
SQUID detects both fusion-gene and non-fusion-gene structural variations from RNA-seq data
Stars: ✭ 37 (+68.18%)
Mutual labels:  structural-variation

🔴 ⚠️ SEDEF has been deprecated. Please use BISER (SEDEF's successor) instead. ⚠️ 🔴

SEDEF: Segmental Duplication Evaluation Framework

SEDEF is a tool for quick detection of segmental duplications in a genome.

Paper

SEDEF has been presented at ECCB 2018 (DOI 10.1093/bioinformatics/bty586). Preprint is available here. Get the final paper here.

Results

👨🎨 Human (hg38) 👨‍🎨 Human (hg19) 🐭 Mouse (mm8)
Final calls Final calls Final calls

The experiment pipeline from the paper is described in this Jupyter notebook.

How to compile

Simple! Do this:

git clone https://github.com/vpc-ccg/sedef
cd sedef
make -j release

By default, SEDEF uses Intel C++ compiler. If you are using g++, build with:

make -j release CXX=g++

If you are using Clang on macOS, compile as

brew install libomp
make -j release OPENMP="-Xpreprocessor -fopenmp" CXX=clang++

You need at least g++ 5.1.0 (C++14) to compile SEDEF. Clang should work fine as well.

SEDEF requires Boost libraries in order to compile. In case you installed Boost in a non-standard directory, you can still compile as follows:

CPATH={path_to_boost} make -j release

How to run

The genome assembly must be soft-masked (i.e. all common and tandem repeats should be converted to lower-case letters) and indexed. Suppose that our genome is hg19.fa (we use UCSC hg19 genome with 24 standard chromosomes that does not contain patches (unGl) or random strains (chrXX_random)).

Automatic transmission

Just go to sedef directory and run

./sedef.sh -o <output> -j <jobs> <genome> 

For example, to run hg19.fa on 80 cores type:

./sedef.sh -o sedef_hg19 -j 80 hg19.fa

You can add -f if sedef_hg19 already exists (it will overwrite the existing content though). The final results will be located in sedef_hg19/final.bed.

Please note that sedef.sh depends on Samtools and GNU Parallel. If you want to experiment with different parameters, run sedef help for parameter documentation.

Output

Output will be located in <out_dir>/final.bed.

The fields of BEDPE file are as follows:

First 6 fields are standard BEDPE fields describing the coordinates of SD mates:

  • chr1, start1 and end1
  • chr2, start2 and end2

Other fields are (in the order of appearance):

Field Description
name SD name
score Total alignment error
strand1 1st SD mate strand
strand2 2nd SD mate strand
max_len Length of longer mate
aln_len Alignment length (length with gaps)
cigar Empty string
comment Comment: currently shows mismatch base error (m) and gap base error (g)
indel_a Number of gap bases in the 1st mate
indel_b Number of gap bases in the 2nd mate
alnB Aligned base count (matches and mismatches without gaps)
matchB Match base count
mismatchB Mismatch base count
transitionsB Transition count (A <-> G and C <-> T)
transversions Transversion count (all mismatches that are not transitions)
fracMatch matchB / alnB
fracMatchIndel matchB / aln_len
jck Jaccard score: where w = mismatchB / alnB
k2K Kimura score: where p = transitionsB / alnB and q = transitionsB / alnB
aln_gaps Number of gaps in the alignment
uppercaseA Number of non-masked (uppercase) bases in the 1st mate
uppercaseB Number of non-masked (uppercase) bases in the 2nd mate
uppercaseMatches Non-masked match count
aln_matches Match base count
aln_mismatches Mismatch base count
aln_gaps Number of gaps in the alignment
aln_gap_bases Number of gap bases (indel_a + indel_b)
cigar CIGAR string of the SD mate alignment
filter_score (aln_gaps + aln_mismatches) / aln_len (should be ≥ 0.5)

All errors are expressed as ratios (0.0--1.0) of the alignment length unless otherwise noted.

Warning: as per WGAC, when calculating the similarity and error rates (fields score, fracMatch, fracMatchIndel and filter_score) SEDEF counts a gap as a single error (so the hypothetical alignment of A-----GC and AT-----C will have error 4 and NOT 8). This might lead to SDs with rather large gap contents. For more filtering, consult comment field that provides the percentage of match/mismatch and gap bases.

Manual transmission

First make sure to index the genome:

samtools faidx hg19.fa

Then run the sedef-search in parallel (in this example, we will use GNU parallel) to get the initial seeds:

mkdir -p out # For the output
mkdir -p out/log # For the logs

for i in `seq 1 22` X Y; do 
for j in `seq 1 22` X Y; do  
	SI=`awk '$1=="chr'$i'" {print $2}' hg19.fa.fai`; 
	SJ=`awk '$1=="chr'$j'" {print $2}' hg19.fa.fai`; 
	if [ "$SI" -le "$SJ" ] ; 
	then 
		for m in y n ; do
		[ "$m" == "y" ] && rc="-r" || rc="";
		echo "sedef search $rc hg19.fa chr$i chr$j >out/${i}_${j}_${m}.bed 2>out/log/${i}_${j}_${m}.log"
		done; 
	fi
done
done | time parallel --will-cite -j 80 --eta

# Now make sure that all runs completed successfully 
grep Total out/log/*.log | wc -l
# You should see here 600 (or n(n+1) if you have n chromosomes in your file)

# Get the single-core running time
grep Wall out/log/*.log | tr -d '(' | awk '{s+=$4}END{print s}'

# Get the maximum meory usage as well
grep Memory out/log/*.log | awk '{if($3>m)m=$3}END{print m}'

Then use sedef-align to bucket the files for the optimal parallel alignment. Afterwards, start the whole alignment:

# First bucket the reads into 1000 bins
mkdir -p out/bins
mkdir -p out/log/bins
time sedef align bucket -n 1000 out out/bins

# Now run the alignment
for j in out/bins/bucket_???? ; do
	k=$(basename $j);
	echo "sedef align generate -k 11 hg19.fa $j >${j}.bed 2>out/log/bins/${k}.log"
done | time parallel --will-cite -j 80 --eta

# Make sure that all runs finished nicely
grep Finished out/log/bins/*.log | wc -l
# Should be number of bins (in our case, 1000)

# Get again the total running time
grep Wall out/log/bins/*.log | tr -d '(' | awk '{s+=$4}END{print s}'

# And the memory
grep Memory out/log/bins/*.log | awk '{if($3>m)m=$3}END{print m}'

Finally, run sedef-stats to produce the final output:

# Concatenate the files 
cat out/*.bed > out.bed # seed SDs
cat out/bins/bucket_???? > out.init.bed # potential SD regions
cat out/bins/*.bed | sort -k1,1V -k9,9r -k10,10r -k4,4V -k2,2n -k3,3n -k5,5n -k6,6n |\
	uniq > out.final.bed # final chains

# Now get the final calls
sedef stats generate hg19.fa out.final.bed |\
		sort -k1,1V -k9,9r -k10,10r -k4,4V -k2,2n -k3,3n -k5,5n -k6,6n |\
		uniq > out.hg19.bed

Acknowledgments

SEDEF uses {fmt}, argh and the modified version of Heng Li's ksw2.

Support

Questions, bugs? Open a GitHub issue or drop me an e-mail at inumanag at mit dot edu.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].