GET_PHYLOMARKERS

GET_PHYLOMARKERS (Vinuesa et al. 2018) is a software package designed to identify optimal genomic markers for phylogenomics, population genetics and genomic taxonomy. It implements a pipeline to filter orthologous gene clusters computed by the companion package GET_HOMOLOGUES to select those with optimal attributes for phylogenetic inference. A species tree is computed from the maximum likelihood gene trees computed from top-scoring alignments using ASTRAL-III. Selected alignments are also concatenated into a supermatrix, which is used to estimate a second species tree from the supermatrix under the maximum-likelihood (ML) criterion with state-of-the-art fast ML tree searching algorithms. GET_PHYLOMARKERS can also estimate ML and parsimony trees from the pan-genome matrix, including unsupervised learning methods to determine the optimal number of clusters from pan-genome and average genomic distance matrices. A detailed manual and step-by-step tutorials document the software and help the user to get quickly up and running. For your convenience, html and markdown versions of the documentation material are available.

Installation, dependencies and Docker image

For detailed instructions and dependencies please check INSTALL.md.

A GET_PHYLOMARKERS Docker image is available, as well as an image bundling GET_PHYLOMARKERS + GET_HOMOLOGUES, ready to use. Detailed instructions for setting up the Docker environment are provided in INSTALL.md. How to run container instances with the test sequences distributed with GET_PHYLOMARKERS is described in the tutorial.

Aim

GET_PHYLOMARKERS (Vinuesa et al. 2018) implements a series of sequential filters (detailed below) to selects markers from the homologous gene clusters produced by GET_HOMOLOGUES with optimal attributes for phylogenomic inference. It estimates gene-trees and species-trees under the maximum likelihood (ML) optimality criterion using state-of-the-art fast ML tree searching algorithms. The species tree is estimated from the supermatrix of concatenated, top-scoring alignments that passed the quality filters outlined in the figures below and explained in detail in the manual and publication.

Figure 1A. Simplified flow-chart of the GET_PHYLOMARKERS pipeline showing only those parts used and described in this work. The left branch, starting at the top of the diagram, is fully under control of the master script run_get_phylomarkes_pipeline.sh. The names of the worker scripts called by the master program are indicated on the relevant points along the flow, as detailed in the manual. The image corresponds to Fig. 1 of Vinuesa et al. 2018.

Figure 1B. Combined filtering actions performed by GET_HOMOLOGUES and GET_PHYLOMARKERS to select top-ranking phylogenetic markers to be concatenated for phylogenomic analyses, and benchmark results of the performance of the FastTree (FT) and IQ-TREE (IQT) maximum-likelihood (ML) phylogeny inference programs. The image corresponds to Fig. 3 of Vinuesa et al. 2018.

GET_HOMOLOGUES is a genome-analysis software package for microbial pan-genomics and comparative genomics originally described in the following publications:

More recently we developed GET_HOMOLOGUES-EST, which can be used to cluster eukaryotic genes and transcripts, as described in Contreras-Moreira et al, Front. Plant Sci. 2017.

If GET_HOMOLOGUES_EST is fed both .fna and .faa files of CDS sequences it will produce identical output to that of GET_HOMOLOGUES and thus can be analyzed with GET_PHYLOMARKERS all the same.

GET_PHYLOMARKERS is primarily tailored towards selecting CDSs (gene markers) to infer DNA-level phylogenies of different species of the same genus or family. It can also select optimal markers for population genetics, when the source genomes belong to the same species (Vinuesa et al. 2018). For more divergent genome sequences, classified in different genera, families, orders or higher taxa, the pipeline should be run using protein instead of DNA sequences.

Figure 2A. Best maximum-likelihood core-genome phylogeny for the genus Stenotrophomonas found in the IQ-TREE search, based on the supermatrix obtained by concatenation of 55 top-ranking alignments. The image corresponds to Fig. 5 of Vinuesa et al. 2018.

Figure 2B. Maximum-likelihood pan-genome phylogeny estimated with IQ-TREE from the consensus pan-genome clusters displayed in the Venn diagram. Clades of lineages belonging to the S. maltophilia complex are collapsed and are labeled as in Figure 2A. Numbers on the internal nodes represent the approximate Bayesian posterior probability/UFBoot2 bipartition support values (see methods). The tabular inset shows the results of fitting either the binary (GTR2) or morphological (MK) models implemented in IQ-TREE, indicating that the former has an overwhelmingly better fit. The scale bar represents the number of expected substitutions per site under the binary GTR2+F0+R4 substitution model. The image corresponds to Fig. 6 of Vinuesa et al. 2018.

Manual and tutorials

Please, follow the links for a detailed manual and tutorials, including a graphical flowchart of the pipeline and explanations of the implementation details.

Citation.

Pablo Vinuesa, Luz-Edith Ochoa-Sanchez and Bruno Contreras-Moreira (2018). GET_PHYLOMARKERS, a software package to select optimal orthologous clusters for phylogenomics and inferring pan-genome phylogenies, used for a critical geno-taxonomic revision of the genus Stenotrophomonas. Front. Microbiol. | doi: 10.3389/fmicb.2018.00771

Published in the Research Topic on "Microbial Taxonomy, Phylogeny and Biodiversity" http://journal.frontiersin.org/researchtopic/5493/microbial-taxonomy-phylogeny-and-biodiversity

A preprint version is available on bioRxiv

Code

Source sode is freely available from GitHub and released under the GNU GPLv3 license.
Docker images ready to pull
- GET_PHYLOMARKERS Docker image
- GET_HOMOLOGUES+GET_PHYLOMARKERS Docker image

Developers

The code is developed and maintained by Pablo Vinuesa at CCG-UNAM, Mexico and Bruno Contreras-Moreira at EEAD-CSIC, Spain. It is released to the public domain under the GNU GPLv3 license.

Acknowledgements

Personal

We thank Alfredo J. Hernández and Víctor del Moral at CCG-UNAM for technical support with server administration.

Funding

We gratefully acknowledge the funding provided by DGAPA-PAPIIT/UNAM (grants IN201806-2, IN211814 and IN206318) and CONACyT-Mexico (grants P1-60071, 179133 and A1-S-11242) to Pablo Vinuesa, as well as the Fundación ARAID,Consejo Superior de Investigaciones Científicas (grant 200720I038 and Spanish MINECO (AGL2013-48756-R) to Bruno Contreras-Moreira.

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

vinuesa / get_phylomarkers

Programming Languages

Labels

Projects that are alternatives of or similar to get phylomarkers