All Projects → tanghaibao → mcscan

tanghaibao / mcscan

Licence: other
Command-line program to wrap dagchainer and combine pairwise results into multi-alignments in column format

Programming Languages

C++
36643 projects - #6 most used programming language
c
50402 projects - #5 most used programming language
python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to mcscan

wgd
Python package and CLI for whole-genome duplication related analyses
Stars: ✭ 68 (+277.78%)
Mutual labels:  genomics, evolution
Quota Alignment
Guided synteny alignment between duplicated genomes (within specified quota constraint)
Stars: ✭ 47 (+161.11%)
Mutual labels:  genomics, evolution
CAMSA
CAMSA: a tool for Comparative Analysis and Merging of Scaffold Assemblies
Stars: ✭ 18 (+0%)
Mutual labels:  genomics, comparative-genomics
gcv
Federating genomes with love (and synteny derived from functional annotations)
Stars: ✭ 22 (+22.22%)
Mutual labels:  comparative-genomics, synteny
reg-gen
Regulatory Genomics Toolbox: Python library and set of tools for the integrative analysis of high throughput regulatory genomics data.
Stars: ✭ 64 (+255.56%)
Mutual labels:  genomics
smartas
📓Notebook of Climente-González et al. (2017), The Functional Impact of Alternative Splicing in Cancer.
Stars: ✭ 13 (-27.78%)
Mutual labels:  genomics
staramr
Scans genome contigs against the ResFinder, PlasmidFinder, and PointFinder databases.
Stars: ✭ 52 (+188.89%)
Mutual labels:  genomics
calN50
Compute N50/NG50 and auN/auNG
Stars: ✭ 20 (+11.11%)
Mutual labels:  genomics
MAGMA Celltyping
Find causal cell-types underlying complex trait genetics
Stars: ✭ 41 (+127.78%)
Mutual labels:  genomics
cloud-genomics
Introduction to Cloud Computing for Genomics
Stars: ✭ 13 (-27.78%)
Mutual labels:  genomics
plasmidtron
Assembling the cause of phenotypes and genotypes from NGS data
Stars: ✭ 27 (+50%)
Mutual labels:  genomics
bystro
Bystro genetic analysis (annotation, filtering, statistics)
Stars: ✭ 31 (+72.22%)
Mutual labels:  genomics
chromap
Fast alignment and preprocessing of chromatin profiles
Stars: ✭ 93 (+416.67%)
Mutual labels:  genomics
dropClust
Version 2.1.0 released
Stars: ✭ 19 (+5.56%)
Mutual labels:  genomics
cell-ontology
An ontology of cell types
Stars: ✭ 75 (+316.67%)
Mutual labels:  genomics
companion
This repository has been archived, currently maintained version is at https://github.com/iii-companion/companion
Stars: ✭ 21 (+16.67%)
Mutual labels:  genomics
GenomicDataCommons
Provide R access to the NCI Genomic Data Commons portal.
Stars: ✭ 64 (+255.56%)
Mutual labels:  genomics
awesome-genetics
A curated list of awesome bioinformatics software.
Stars: ✭ 60 (+233.33%)
Mutual labels:  genomics
full spectrum bioinformatics
An open-access bioinformatics text
Stars: ✭ 26 (+44.44%)
Mutual labels:  genomics
mapcomp
Genetic Map Comparison
Stars: ✭ 18 (+0%)
Mutual labels:  genomics

Welcome to MCscan's documentation!

The MCscan download page can be accessed here.

Usage

This software provides a clustering module for viewing the relationship of colinear segments in multiple genomes (or heavily redundant genomes). It takes the predicted pairwise segments from dynamic programming (DAGchainer in particular) and then try to build consensus segments from a set of related, overlapping segments.

Certain part of this package (dagchainer.cc) is based on the TIGR software DAGchainer. The program used this as an initial step to generate pairwise segments.

Along with the DAGchainer guidelines, all code is copiable, distributable, modifiable, and usable without any restrictions.

Installation

Note

MCscan currently will only run on linux or cygwin platform, as it is dependent on GNU function.

Simply put mcscan.tar.gz in any directory:

$ tar zxf mcscan_version.tar.gz
$ cd mcscan_version.tar.gz/ && make

the compiled codes are within the same directory as the source.

Then put copy of MCL executable within the same folder as MCscan (MCL program downloadable here).

Inputs and outputs

MCscan reads in at least two sources of data: .blast file and .bed file. This may seem daunting at first, but these are very easy to retrieve. Have a look at the at_at.blast, at_at.bed in the folder. In the actual execution, MCL is used to generate mcl file (at_at.mcl), which is used in multiple synteny construction.

Here is what can be used to genenerate the files.

The blast file is the following tab-delimited format:

gene1    gene2    e-value

easily genenerated from a m8 blast output format:

$ cut -f 1,2,11 xyz.m8 > xyz.blast.unfiltered

The first thing please ensure that for each gene pair, only one e-value is reported, the blast output normally would contain multiple HSPs, a convenience script is attached to filter all the redundant pairs:

$ python filter_blast.py xyz.blast.unfiltered xyz.blast

The .bed file contains the following tab-delimited format (see bed format):

chromosome_id    start    stop    gene_name

notice when you compare multiple genomes, formulate your molecule name carefully to avoid duplicated names. The .bed file can usually be generated by parsing the gene annotation file provided by the sequencing group (usually the sequencing project ftp will provide a .gff3 file).

Once you have everything ready, put them in the same folder. We need to generate .mcl file if this is the first run (also take a look at the example in run.sh):

$ more xyz.blast | mcl - --abc --abc-neg-log -abc-tf 'mul(0.4343), ceil(200)' -o xyz.mcl

Some might encounter a problem exec the mcl command, in which case the mcl binary needs to be rebuilt from here. After the first time you run it (the mcl file has been generated). You can simply use:

$ ./mcscan xyz

Parameters (for advanced user)

The help screen:

Usage: mcscan [OPTION...] prefix_fn
MCSCAN -- multiple collinearity scan (compiled Jul  6 2010 15:52:03)

Reference:
 Tang,H., Wang,X., Bowers,J.E., Ming,R., Alam,M., Paterson,A.H.
 Unraveling ancient hexaploidy through multiply-aligned angiosperm gene maps
 Genome Research (2008) 18, 1944-1954.

  -a                         only builds the pairwise blocks (.aligns file)
  -A                         use base pair dist instead of gene ranks
  -b                         limit within genome synteny (e.g. Vv-Vv) mapping
  -e, --e_value=E_VALUE      alignment significance
  -g, --gap_score=GAP_SCORE  gap penalty
  -k, --match_score=MATCH_SCORE   final score=MATCH_SCORE+NUM_GAPS*GAP_SCORE
  -p, --pivot=PIVOT          PIVOT is the reference genome, make it two letter
                             prefix inyour .bed file, everything else will be
                             aligned to the reference
  -s, --match_size=MATCH_SIZE   number of genes required to call synteny
  -u, --unit_dist=UNIT_DIST  average intergenic distance
  -?, --help                 Give this help list
      --usage                Give a short usage message
  -V, --version              Print program version

Mandatory or optional arguments to long options are also mandatory or optional
for any corresponding short options.

Report bugs to <[email protected]>.

The default values are quite generic and should work for many instances. The following are more detailed information for users who wish to tune their results.

The pairwise synteny formula is roughly (Haas et al. 2004), note that DIST_X, DIST_Y is in base pair:

FINAL_SCORE = MATCH_SCORE*num_matches+max(DIST_X, DIST_Y)/UNIT_DIST*GAP_SCORE

The multiple synteny formula is roughly, now DIST_X, DIST_Y is the distance in the partial order graph (not in base unit, but in gene index unit):

FINAL_SCORE = MATCH_SCORE*num_matches+max(DIST_X, DIST_Y)*GAP_SCORE

Sometimes you may want to run just the pairwise synteny on .blast and .bed files, then you can try:

$ ./mcscan at_at -a

Note that to run this, .mcl file is not required, the result is now slightly different, since MCscan uses the mcl file to filter the BLAST hits.

Walkthrough example

There are, by default at_vv sets of files and os_sb sets of files, which is basically two different projects.

First example, let us compare Os to Sb (rice to sorghum), just default settings, run:

$ ./mcscan os_sb

It takes about one minute to run, the result is best viewed in EXCEL. The first part of the file lists all the parameters of the program. The result is separated with a line like this:

## View 11: pivot Sb02

This is called a view, each view uses a different chromosome as the reference. Then the blocks following this line is the multiply aligned blocks. The first column is numerical identifier, the second column is the actual pivot. Then following columns are the regions that are aligned to the pivot. The alignments between rice and sorghum are in fact complicated by one or more shared WGDs, creating several columns but mostly are four regions matching each other.

For the second example, we wish to align Arabidopsis to grape, and use grape as the reference genome, but we need to do it a little differently. Unlike the first example, we are not interested in WGD in grape in this case, and we only wish to see the grape used as pivot. Therefore, we modify the pivot:

$ ./mcscan at_vv -p Vv -b

This trick -b will limit any Vv-Vv matches (in fact this is an older duplication called gamma) in the output.

There are two outputs. .aligns file and .blocks file, corresponding to pairwise and multiple synteny respectively. You will find the .aligns file very useful too, sometimes. But this is essentially similar to the output of DAGchainer (adding a few statistics and change the default paramters).

Changelog

  • May 12, 2007 (version <0.5) initial release.
  • Aug 05, 2007 (version 0.5) add the option of of a reference genome
  • Oct 13, 2007 (version 0.6) add convenience python script to streamline the process
  • Mar 07, 2008 (version 0.7) implement statistical test for pairwise syntenic blocks
  • Nov 13, 2008 (version 0.8) partial-order graph for alignment

Contact

Any questions, problems, bugs are welcome and should be dumped to

Haibao Tang : bao at uga dot edu

Plant Genome Mapping Laboratory, University of Georgia

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].