All Projects → mcveanlab → mccortex

mcveanlab / mccortex

Licence: MIT License
De novo genome assembly and multisample variant calling

Programming Languages

c
50402 projects - #5 most used programming language
Makefile
30231 projects
perl
6916 projects
python
139335 projects - #7 most used programming language
r
7636 projects
shell
77523 projects

Projects that are alternatives of or similar to mccortex

redundans
Redundans is a pipeline that assists an assembly of heterozygous/polymorphic genomes.
Stars: ✭ 90 (-14.29%)
Mutual labels:  genomics, genome-assembly, contigs
indelope
find large indels (in the blind spot between GATK/freebayes and SV callers)
Stars: ✭ 38 (-63.81%)
Mutual labels:  genomics, variant-calling, genome-assembly
CAMSA
CAMSA: a tool for Comparative Analysis and Merging of Scaffold Assemblies
Stars: ✭ 18 (-82.86%)
Mutual labels:  genomics, genome-assembly
LRSDAY
LRSDAY: Long-read Sequencing Data Analysis for Yeasts
Stars: ✭ 26 (-75.24%)
Mutual labels:  genomics, genome-assembly
fermikit
De novo assembly based variant calling pipeline for Illumina short reads
Stars: ✭ 98 (-6.67%)
Mutual labels:  genomics, variant-calling
tiptoft
Predict plasmids from uncorrected long read data
Stars: ✭ 27 (-74.29%)
Mutual labels:  genomics, kmer
cerebra
A tool for fast and accurate summarizing of variant calling format (VCF) files
Stars: ✭ 55 (-47.62%)
Mutual labels:  genomics, variant-calling
STing
Ultrafast sequence typing and gene detection from NGS raw reads
Stars: ✭ 15 (-85.71%)
Mutual labels:  genomics, kmer
HLA
xHLA: Fast and accurate HLA typing from short read sequence data
Stars: ✭ 84 (-20%)
Mutual labels:  genomics, variant-calling
Clair3
Clair3 - Symphonizing pileup and full-alignment for high-performance long-read variant calling
Stars: ✭ 119 (+13.33%)
Mutual labels:  genomics, variant-calling
BALSAMIC
Bioinformatic Analysis pipeLine for SomAtic Mutations In Cancer
Stars: ✭ 29 (-72.38%)
Mutual labels:  genomics, variant-calling
GenomicsDB
Highly performant data storage in C++ for importing, querying and transforming variant data with C/C++/Java/Spark bindings. Used in gatk4.
Stars: ✭ 77 (-26.67%)
Mutual labels:  genomics, variant-calling
MGSE
Mapping-based Genome Size Estimation (MGSE) performs an estimation of a genome size based on a read mapping to an existing genome sequence assembly.
Stars: ✭ 22 (-79.05%)
Mutual labels:  genomics, genome-assembly
EarlGrey
Earl Grey: A fully automated TE curation and annotation pipeline
Stars: ✭ 25 (-76.19%)
Mutual labels:  genomics, genome-analysis
berokka
🍊 💫 Trim, circularise and orient long read bacterial genome assemblies
Stars: ✭ 23 (-78.1%)
Mutual labels:  genomics, genome-assembly
instaGRAAL
Large genome reassembly based on Hi-C data, continuation of GRAAL
Stars: ✭ 32 (-69.52%)
Mutual labels:  genomics, genome-assembly
haslr
A fast tool for hybrid genome assembly of long and short reads
Stars: ✭ 68 (-35.24%)
Mutual labels:  genomics, genome-assembly
dysgu
dysgu-SV is a collection of tools for calling structural variants using short or long reads
Stars: ✭ 47 (-55.24%)
Mutual labels:  genomics, variant-calling
arcsv
Complex structural variant detection from WGS data
Stars: ✭ 16 (-84.76%)
Mutual labels:  genomics, variant-calling
cloud-genomics
Introduction to Cloud Computing for Genomics
Stars: ✭ 13 (-87.62%)
Mutual labels:  genomics

McCortex: Population De Novo Assembly and Variant Calling

Multi-sample de novo assembly and variant calling using Linked de bruijn graphs. Variant calling with and without a reference genome. Between closely related samples or highly diverged ones. From bacterial to mammalian genomes. Minimal configuration. And it's free.

Isaac Turner's rewrite of cortex_var, to handle larger populations with better genome assembly, as a set of modular commands. PhD supervisor: Prof Gil McVean. Collaborators: Zam Iqbal, Kiran Garimella. Based at the Wellcome Trust Centre for Human Genetics, University of Oxford.

27 May 2018

Branch Status
master: Build Status
develop: Build Status
code analysis: Coverity Scan Build Status

Build

McCortex compiles with clang and gcc. Tested on Mac OS X and linux. Requires zlib. Download with:

git clone --recursive https://github.com/mcveanlab/mccortex

Install dependencies (for htslib) on mac:

brew update
brew install xz

Or on linux:

sudo apt install liblzma-dev libbz2-dev
sudo apt install r-base-core  # if you want to plot with R

To compile for a maximum kmer size of 31:

make all

to compile for a maximum kmer size of 63:

make MAXK=63 all

Executables appear in the bin/ directory.

Quickstart: Variant calling

Download and compile McCortex. Can be in any directory, later I'll assume it's in ~/mccortex/:

git clone --recursive https://github.com/mcveanlab/mccortex
cd mccortex
make all MAXK=31
make all MAXK=63

Now write a file detailing your samples and their data. Columns are separated by one or more spaces/tabs. File entries are separated by commas. Paired-end read files are separated by a colon ':'. File paths can be relative to the current directory or absolute. Most fileformats are supported:

cd /path/to/your/data
echo "#sample_name  SE_files   PE_files                     interleaved_files" >  samples.txt
echo "Mickey        a.fa,b.fa  reads.1.fq.gz:reads.2.fq.gz  ."                 >> samples.txt
echo "Minney        .          reads.1.fq.gz:reads.2.fq.gz  in.bam"            >> samples.txt
echo "Pluto         seq.fq     .                            pluto.cram"        >> samples.txt

Create a job file from your sample file (samples.txt). All output will go into the directory we specify (mc_calls). We also specify the kmer(s) to use. We'll run at k=31 and k=61 and merge the results.

If your data are haploid, we set --ploidy 1:

~/mccortex/scripts/make-pipeline.pl -r /path/to/ref.fa --ploidy 1 31,61 mc_calls samples.txt > job.k31.k61.mk

If your samples are human, you have a mix of haploid and diploid chromosomes. Therefore you need to specify which samples have only one copy of chrX and one of chrY. The format is -P <sample>:<chr>:<ploidy> where <sample> and <chr> can be comma-separated lists. Ploidy arguments are read in order.

~/mccortex/scripts/make-pipeline.pl -r /path/to/ref.fa --ploidy "-P .:.:2 -P .:chrY:1 -P Mickey:chrX:1" 31,61 mc_calls samples.txt > job.k31.k61.mk

Now you're ready to run. You'll need to pass:

  • path to McCortex CTXDIR=
  • how much memory to use MEM= (2GB for ten E. coli, 70GB for a human)
  • number of threads to use NTHREADS=

Run the job file:

make -f job.k31.k61.mk CTXDIR=~/mccortex MEM=70GB NTHREADS=8 \
                       JOINT_CALLING=yes USE_LINKS=no brk-geno-vcf

For a human genome, running time will be about 8 hours for a single sample and use about 70GB RAM. For small numbers of similar samples, peak memory usage will remain the same as a single sample, and should increase roughly logarithmically with the number of samples.

Job finished? Your results are in: mc_calls/vcfs/breakpoints.joint.plain.k31.k61.geno.vcf.gz.

Something go wrong? Take a look at the log file of the last command that ran. You may need to increase memory or compile for a different MAXK= value. Once you've fixed the issue, just rerun the make -f job... command. Add --dry-run to the make command to see which commands are going to be run without running them.

De novo genotyping: once de Bruijn graphs have been constructed, they can be used to genotype existing call sets (VCF+ref) without using mapped reads. See the wiki.

Commands

usage: mccortex31 <command> [options] <args>
version: ctx=XXXX zlib=1.2.5 htslib=1.2.1 ASSERTS=ON hash=Lookup3 CHECKS=ON k=3..31

Commands:   breakpoints  use a trusted assembled genome to call large events
            bubbles      find bubbles in graph which are potential variants
            build        construct cortex graph from FASTA/FASTQ/BAM
            calls2vcf    convert bubble/breakpoint calls to VCF
            check        load and check graph (.ctx) and path (.ctp) files
            clean        clean errors from a graph
            contigs      assemble contigs for a sample
            correct      error correct reads
            coverage     print contig coverage
            dist         make colour kmer distance matrix
            index        index a sorted cortex graph file
            inferedges   infer graph edges between kmers before calling `thread`
            join         combine graphs, filter graph intersections
            links        clean and plot link files (.ctp)
            pjoin        merge link files (.ctp)
            popbubbles   pop bubbles in the population graph
            pview        text view of a cortex link file (.ctp)
            reads        filter reads against a graph
            rmsubstr     reduce set of strings to remove substrings
            server       interactively query the graph
            sort         sort the kmers in a graph file
            subgraph     filter a subgraph using seed kmers
            thread       thread reads through cleaned graph to make links
            uniqkmers    generate random unique kmers
            unitigs      pull out unitigs in FASTA, DOT or GFA format
            vcfcov       coverage of a VCF against cortex graphs
            vcfgeno      genotype a VCF after running vcfcov
            view         text view of a cortex graph file (.ctx)


  Type a command with no arguments to see help.

Common Options:
  -h, --help            Help message
  -q, --quiet           Silence status output normally printed to STDERR
  -f, --force           Overwrite output files if they already exist
  -m, --memory <M>      Memory e.g. 1GB [default: 1GB]
  -n, --nkmers <H>      Hash entries [default: 4M, ~4 million]
  -t, --threads <T>     Limit on proccessing threads [default: 2]
  -o, --out <file>      Output file
  -p, --paths <in.ctp>  Assembly file to load (can specify multiple times)

Getting Helps

Type a command with no arguments to see usage. The following may also be useful:

Code And Contributing

Issues can be submitted on github. Pull requests welcome. Please add your name to the AUTHORS file. Code should compile on mac/linux with clang/gcc without errors or warnings.

More on the wiki

Unit tests are run with make test and integration tests with cd tests; ./run. Both of these test suites are run automatically with Travis CI when commits are pushed to GitHub.

Static analysis can be run with cppcheck:

cppcheck src

or with clang:

rm -rf bin/mccortex31
scan-build make RECOMPILE=1

Occasionally we also run Coverity Scan. This is done by pushing to the coverity_scan branch on github, which triggers Travis CI to upload the latest code to Coverity.

Coverity Scan Build Status

git checkout coverity_scan
git merge develop
git checkout --ours .travis.yml

License: MIT

Bundled libraries may have different licenses:

Used in testing:

Citing

If you find McCortex useful, please cite our paper:

Other Cortex papers:

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].