TOGA

TOGA is a new method that integrates gene annotation, inferring orthologs and classifying genes as intact or lost.

TOGA implements a novel machine learning based paradigm to infer orthologous genes between related species and to accurately distinguish orthologs from paralogs or processed pseudogenes.

This tutorial explains how to get started using TOGA. It shows how to install and execute TOGA, and how to handle possible issues that may occur.

Installation

TOGA supports both Linux and MacOS systems. The package was properly tested on Python version 3.6.5. and 3.7.3.

It is highly recommended to have access to computational cluster, but for small or partial genomes with short genes a desktop PC will be enough.

TOGA requires Nextflow, which in turn requires java >=8. Check your version of java and install nextflow using one of the following commands:

curl -fsSL https://get.nextflow.io | bash
# OR
conda install -c bioconda nextflow

If you've downloaded nextflow using curl, move the nextflow executable to a directory accessible by your $PATH variable.

To get TOGA do the following:

# clone the repository
git clone https://github.com/hillerlab/TOGA.git
cd TOGA
# install necessary python packages:
python3 -m pip install -r requirements.txt --user
# call configure to:
# 1) train xgboost models
# 2) download CESAR2.0
# 3) compile C code
./configure.sh
# run a test, it will take a couple of minutes
./run_test.sh micro

If you see something like this at the very end, then TOGA is almost ready to go:

Orthology class sizes:
one2one: 3
Done! Estimated time: 0:01:02.800084
Program finished with exit code 0
JH567521 299723 336583 ENST00000618101.1169 879 + 299723 336583 0,0,200 7 28,923,130,173,200,179,248, 0,1256,6085,6677,19146,21311,36612,
JH567521 463144 506100 ENST00000262455.1169 711 - 463144 506100 0,200,255 8 102,103,142,112,117,58,116,185, 0,1982,30295,31351,36911,38566,41322,42771,
JH567521 395878 449234 ENST00000259400.1169 942 + 395878 449234 0,0,200 7 123,66,226,116,51,87,240, 0,11871,38544,45802,45994,52305,53116,
Success!

If you experience any problems installing TOGA, please visit the troubleshooting section.

Configuring TOGA for cluster

TOGA uses nextflow to run cluster-dependent steps. To run a pipeline on cluster nextflow requires a configuration file defining "executors" component. This repository contains configuration files for slurm cluster, please find them in the nextflow_config_files directory.

To create configuration files for non-slurm cluster do the following:

Find here what parameters are available for your cluster. Most likely, you can use slurm configuration files as a reference.
Create a separate directory for configuration files, or re-use nextflow_config_files dir.
Create "extract_chain_features_config.nf" file. This file contains configuration for chain features extraction step. These jobs are expected to be short and not memory consuming, so 1 hour of runtime limit and 10Gb of memory would be enough.
Create "call_cesar_config_template.nf" file. This configuration file is for CESAR jobs. These jobs usually take much longer that chain feature extraction, it's recommended to request 24 hors for them. You don't have to provide an exact amount of memory for these jobs, TOGA will compute this itself. Please write a placeholder instead, as follows: process.memory = "${_MEMORY_}G".

Final test

This repository also contains sample data to perform a wide-scale test. To do so, please download genome sequences for human (GRCh38/hg38) and mouse (GRCm38) in the 2bit format. You can download these 2bit files using the following links:

Human 2bit

Mouse 2bit

Then call the following:

./toga.py test_input/hg38.mm10.chr11.chain test_input/hg38.genCode27.chr11.bed ${path_to_human_2bit} ${path_to_mouse_2bit} --kt --pn test -i supply/hg38.wgEncodeGencodeCompV34.isoforms.txt --nc ${path_to_nextflow_config_dir} --cb 3,5 --cjn 500 --u12 supply/hg38.U12sites.tsv --ms

This will take about 20 minutes on 500 cores cluster.

Troubleshooting

TOGA's configure script automatically tries to install all dependencies. If you encounter error messages related to these two dependencies, please see below for help.

XGBoost
Nextflow

Do note that previous TOGA versions used BerkeleyDB. Now it has been replaced by HDF5.

XGBoost

Sometimes xgboost installation with pip doesn't work and shows a message like:

Command "/usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/genome/scratch/tmp/pip-install-g6qbjl5j/xgboost/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /genome/scratch/tmp/pip-record-4dhjvr_9/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /genome/scratch/tmp/pip-install-g6qbjl5j/xgboost/

One of solutions is to compile XGBoost from sources, as explained here:

https://xgboost.readthedocs.io/en/latest/build.html

Please note that xgboost requires CMake >=3.13 for build.

Nextflow

There are several nextflow-related issues that you might encounter. If you have any issues with installation most likely this is something related to java version. The simplest solution would be to install nextflow using conda:

conda install -c bioconda nextflow

This will automatically add nextflow executable to $PATH. Alternatively, you can install java using the following command:

sudo apt install openjdk-8-jre-headless

And then install nextflow using curl | bash command.

Nextflow also might show the following error message:

Can't open cache DB: /lustre/projects/project-xxx/.nextflow/cache/a80d212d-5a68-42b0-a8a5-d92665bdc492/db

Nextflow needs to be executed in a shared file system that supports file locks.
Alternatively you can run it in a local directory and specify the shared work
directory by using by `-w` command line option.

In this case nextflow is not able to write temporary files and logs in the current directory. Probably you can find a solution together with your system administrator, however, there could be substantive reasons for disabling file locks. As a workaround, you can do the following:

Find a directory outside the cluster file system that you can access, that could be /home/$username, /tmp/$username or something like this. Then create some directory inside, let's say "nextflow_temp". Nextflow writes quite a lot of hidden files so it could be reasonable.
When you call toga.py, add --nd flag to the command, such as --nd /home/user/nextflow_temp

In this case TOGA will call nextflow from the specified directory.

If something doesn't work and you like to configure managing cluster jobs yourself then please have a look at the following functions in the toga.py script:

__chain_genes_run: this function pushes cluster jobs to extract chain features.
__run_cesar_jobs: this one is responsible for calling CESAR jobs.

To execute a batch of jobs in parallel TOGA creates a temporary text file containing commands that might be executed independently, it looks like this:

./script.py input/part_1.txt output/part_1.txt
./script.py input/part_2.txt output/part_2.txt
./script.py input/part_3.txt output/part_3.txt
./script.py input/part_4.txt output/part_4.txt
...

Then TOGA pushes these jobs to cluster queueing system and waits until they are done. Please note that each line of this file contains a complete and independent command. It means that these commands could be sequentially executed even like this:

bash jobslist.txt

Usage

This section explains TOGA usage, especially toga.py arguments and input files format.

Input files

TOGA is a reference-based genome annotation tool, which means that it needs the following data as input:

Gene annotation of the reference genome
Genome alignment between the reference and query genome(s)
Reference and query genome sequences

Gene annotation of reference genome

Bed-12 file

TOGA accepts a bed12-formatted file as a reference genome annotation. This file is mandatory for running TOGA.

Please find bed12 format specification under: https://genome.ucsc.edu/FAQ/FAQformat.html#format1

Example for human gene MAP1S which has two transcripts

user@user$ grep ENST00000544059 supply/hg38.wgEncodeGencodeCompV34.bed
chr19 17720155 17734490 ENST00000544059 0 + 17720390 17734428 0 7 275,102,83,141,2344,236,218, 0,780,3970,4893,5673,13037,14117,
user@user$ grep ENST00000324096 supply/hg38.wgEncodeGencodeCompV34.bed
chr19 17719479 17734513 ENST00000324096 0 + 17719502 17734428 0 7 141,102,83,141,2344,236,241, 0,1456,4646,5569,6349,13713,14793,

This repository contains examples of bed12 file for human and mouse:

Human genome annotation: supply/hg38.wgEncodeGencodeCompV34.bed
Mouse genome annotation: supply/mm10.wgEncodeGencodeCompVM25.bed

Some advice about your reference annotation:

Please make sure that the length of the CDS of your annotations is divisible by 3. TOGA will skip transcripts that do not satisfy this criteria.
This is highly recommended that CDS of your transcripts start with ATG and end with a canonical stop codon.
Your transcripts are coding, meaning that thickStart and thickEnd are not equal. TOGA would skip non-coding transcripts.
Avoid any pseudogenes in the reference annotations.
Also, try to avoid merged and incomplete transcripts.
Make sure that transcript identifiers are unique, e.g. avoid cases where two or more transcripts have the same identifier.

Optional but highly recommended: Isoform data

One gene can have multiple isoforms. TOGA can handle more than one isoform per gene, meaning it is not necessary to reduce the transcript data to the isoform with the longest CDS. Isoform data is optional, but if available increases annotation completeness and gene loss determination accuracy. If you do not provide isoforms data, TOGA will treat each transcript in the bed12 file as a separate gene.

Isoforms can be provided to TOGA in a single two-column tab-separated file in the following format: GeneIdentifier {tab} TranscriptIdentifier The first line can be a header.

Example: The human gene MAP1S (ENSG00000130479) has 2 isoforms: ENST00000544059 and ENST00000324096.

user@user$ grep ENSG00000130479 supply/hg38.wgEncodeGencodeCompV34.isoforms.txt
ENSG00000130479 ENST00000544059
ENSG00000130479 ENST00000324096

Importantly, all transcripts listed in the bed12 file have to occur in this isoform file, otherwise TOGA throw an error. Examples for human and mouse Gencode annotations:

For human: hg38.wgEncodeGencodeCompV34.isoforms.txt
For mouse: mm10.wgEncodeGencodeCompVM25.isoforms.txt

The simplest way to obtain isoforms file is:

Visit https://www.ensembl.org/biomart/martview
Choose Ensembl Genes N dataset and then [species of interest] genes
Go to Filters tab, select "gene type" - protein coding
Go to Attributes tab, select:
- Gene stable ID
- Transcript stable ID
- Uncheck all other marks!
Download the results as a tsv file

U12 introns data

You also can provide data of U12 exons in the reference genome, it would facilitate gene loss detection process. However, this is not mandatory.

There are examples for U12 data files:

For human: supply/hg38.U12sites.tsv
And for mouse: supply/mm10.U12sites.tsv

There are tab-separated files containing 3 columns:

First column contains transcript identifier
Second one: exon number
Third column contains A/D letter, which means "acceptor" or "donor".

If gene loss pipeline detects mutations of the splice sites listed in this file, it would not consider them as inactivating.

Genome alignment

This section explains what "please provide a genome alignment" actually means.

Chain file

Chain file is a mandatory (more precisely, essential) for running TOGA.

According to its name, chain file is a text file that describes chains. Chains represent co-linear local alignments that occur in the same order on a reference and a query chromosome. Thus, a collection of chains could represent a whole-genome pairwise alignment.

Here is a more detailed explanation of chains:

http://genomewiki.ucsc.edu/index.php/Chains_Nets

You can find chain file format specification here:

https://genome.ucsc.edu/goldenPath/help/chain.html

You can also provide gzipped chain file, then please make sure that the filename ends with ".chain.gz".

Please make sure that each chain you provide has a unique identifier!

Reference and query genome sequences

TOGA accepts reference and query genomes sequences in the 2bit format.

Here is the specification for 2bit file format:

http://genome.ucsc.edu/FAQ/FAQformat.html#format7.

Please make sure that chromosome/scaffold names in the bed12, chain and 2bit files are consistent. For example, if the first chromosome 1 is called "chr1" in the 2bit file, it should also be "chr1" in the bed12 file, not "chromosome_1" or just "1".

Arguments

To call TOGA use toga.py script. It accepts the following arguments:

Positional, mandatory arguments

Chain file containing the genome alignment.
Bed file containing reference genome annotation.
Path to reference genome 2bit file.
Path to query genome 2bit file.

Optional arguments

-h, --help

Show help message and exit. Calling ./toga.py without arguments does the same.

--project_folder PROJECT_FOLDER

Project directory. TOGA will save all intermediate and output files exactly in this directory. If not specified, use CURRENT_DIR/PROJECT_NAME as default (see below).

--project_name PROJECT_NAME

If you don't like to provide a full path to the project directory with --project_folder you can use this parameter. In this case TOGA will create project directory in the current directory as "CURRENT_DIR/PROJECT_NAME". If not provided, TOGA will try to extract the project name from chain filename, which is not recommended.

--min_score MIN_SCORE

Chain score threshold. Exclude chains that have a lower score from the analysis. Default value is 15000.

--no_chain_filter, --ncf

A flag. Do not filter the chain file (make sure you specified a .chain but not .gz file in this case)

--isoforms, -i ISOFORMS_FILE

Path to isoforms data file, details are above.

--keep_temp, --kt

A flag. With this flag TOGA will not remove temporary and intermediate files. Highly recommended for the first times: it will be easier to trace issues.

--limit_to_ref_chrom

Limit analysis to a single reference chromosome. If you provide whole genome alignment for human and mouse but would like to perform the analysis on the human chr11 only, then add "--limit_to_chrom chr11" parameter.

--chain_jobs_num, --chn NUMBER_OF_JOBS

Number of cluster jobs for extracting chain features. Recommended from 20 to 50 jobs.

--cesar_jobs_num, --cjn NUMBER_OF_JOBS

Number of CESAR cluster jobs. Recommended from 300 to 1500, depending on your patience and a number of available cluster cores.

--mask_stops, --ms

A flag. CESAR cannot process coding sequences containing in-frame stop codons. However, sometimes we need to project these genes and they are actually intact, like selenocysteine-coding ones. With this parameter TOGA will mask stop codons and CESAR could process them. Without this parameter TOGA will crash if meet any in-frame stop codon in the reference. Use this flag only if you are sure that your reference sequences might contain in-frame stop codons.

--cesar_mem_limit CESAR_MEM_LIMIT

CESAR2.0 might be very memory-consuming. For most of the genes it requires less than 5Gb of RAM, but some exceptionally long genes might require more than 100Gb.

TOGA will skip CESAR jobs that would require more that CESAR_MEM_LIMIT GB of memory. Of course, you will find such genes in the log.

--cesar_buckets, --cb CESAR_BUCKETS

Where CESAR_BUCKETS is a comma-separated list of integers, such as "5,15,50". This is the evolution of previous parameter. You can split CESAR jobs into different buckets according to their memory requirements. Imagine you need to call CESAR for 1000 genes. 900 of the cluster jobs will require less than 10Gb of RAM and the rest require from 10 to 100Gb. For sure you can split these genes into 100 cluster jobs and require 100Gb of RAM for each of them. But then 90% of the time your jobs will consume just a tiny bit of the requested memory, which is likely a unsustainable use of cluster resources. But if you call TOGA with the following parameter:

--cesar_buckets 10,100

then TOGA will create and push separately two joblists: one for cluster jobs that require less then 10Gb, and the second one for jobs that require from 10 to 100Gb.

It will automatically set --cesar_mem_limit, in case you provided "5,15,50" as --cesar_buckets, the --cesar_mem_limit would be 50Gb.

--cesar_chain_limit CESAR_CHAIN_LIMIT

Skip genes that have more that CESAR_CHAIN_LIMIT orthologous chains. Recommended values are a 50-100.

--u12 U12

Path to U12 introns data.

--stop_at_chain_class, --sac

A flag. If set, TOGA halts after chain classification step. So it doesn't annotate query genome, just produces a list of orthologous chains for each reference gene.

--o2o_only, --o2o

Process only the genes that have a single orthologous chain. Please note that many-2-one orthologs could not be filtered out at this stage!

--no_fpi

A flag. Consider long frame-preserving indels as inactivating mutations.

--nextflow_dir NEXTFLOW_DIR, --nd NEXTFLOW_DIR

Nextflow working directory: from this directory nextflow is executed, also there all nextflow log files are kept

nextflow_config_dir NEXTFLOW_CONFIG_DIR, --nc NEXTFLOW_CONFIG_DIR

Directory containing nextflow configuration files for cluster, pls see nextflow_config_files/readme.txt for details.

Output reading

This section describes TOGA output.

Transcript naming convention

To predict a transcript in the query genome TOGA projects a reference transcript via an orthologous chain. Since each reference transcript might have more than one orthologous chains, TOGA uses the following convention for naming predicted transcripts: "reference_transcript_name"."orthologous_chain_id".

For example, if TOGA identified two orthologous chains (1 and 2) for a transcript A, you will find two transcripts in the query genome: A.1 and A.2. If you see two identical annotations in the query genome named as A.1 and B.2, it means that predicted transcript is an ortholog to reference transcripts A and B.

In the code predicted transcripts are also called "projections".

query_annotation.bed

Bed12 formatted file containing annotation tracks for the query genome. Could be loaded to UCSC genome browser.

Predictions have different colors according to gene loss pipeline classes (please see "loss_summ_data.tsv" section for details). The color code is:

Black - N, no data
Brown - PG, paralogous projection
Grey - M and PM, missing and partially missing
Red - L, clearly lost.
Salmon - "grey", neither intact nor lost.
Light blue - PI, partially intact.
Blue - I, intact.

query_isoforms.tsv

Isoforms file for the query.

Fasta files

TOGA produces 3 fasta files: prot.fasta, codon.fasta nucleotide.fasta. It saves both the reference and predicted query sequences.

prot.fasta contains protein sequences of reference genes and predicted transcripts.
codon.fasta contains codon alignments, corrected for frameshiring insertions and deletions
nucleotide.fasta contains exon nucleotide alignments

orthology_classification.tsv

File containing orthology data. It contains 5 columns:

t_gene - gene name in the reference
t_transcript - transcript identifier in the reference
q_gene - gene identifier in the query
q_transcript - transcript identifier in the query
orthology_class - class of orthology relationships, such as: one2one, one2many, many2one, many2many and one2zero.

loss_summ_data.tsv

This file contains gene loss pipeline classification for each projection, transcript and gene. TOGA identifies the following classes:

N - no data due to technical reasons (such as CESAR memory requirements)
PG - no orthologous chains identified, TOGA projected transcripts via paralogous chains and cannot make any conclusion.
PM - partial & missing. Most of the projection lies outside scaffold borders.
L - clearly lost.
M - missing, assembly gaps mask >50% of the prediction CDS.
G - "grey", there are inactivating mutations but not enough evidence for "clearly lost" class. In other words: neither lost nor intact.
PI - partially intact: some fraction of CDS is missing, but most likely this is intact.
I - clearly intact.

proc_pseudogenes.bed

Bed-formatted file containing annotation of processed pseudogenes in the query.

genes_rejection_reason.tsv

If TOGA skips any gene, transcript or projection, it writes about that in this file. Also this file shows a reason, why this happened.

Inactivating mutations visualization

There is a possibility to visualise inactivating mutations detected in the projected transcript. There are 3 levels available:

Projection level: visualize one projection only, which is named as "transcript_id"."chain_id".
Visualize all projections of a particular transcript. If a transcript has only one orthologous chain, that it's the same with level 1. But if TOGA identified several orthologous chains you can visualize them all at once.
Visualize the entire gene. If a gene has several transcripts, then it makes sense.

To make visualisations you need:

Merge all inactivating mutations data into a single file.

cat ${PROJECT_DIR}/inact_mut_data/* > ${PROJECT_DIR}/inact_mut_data.txt

Since the merged file could be huge you can index it. It would also be helpful if you plan to plot numerous genes. You can do it using "mut_index.py" script in the "supply" directory:

./supply/mut_index.py ${PROJECT_DIR}/inact_mut_data.txt ${PROJECT_DIR}/inact_mut_data.hdf5

Then use "./supply/plot_mutations.py" script to create a visualization. This script requires the following:

Reference bed file.
List of mutations: either text file of an indexed one.
Transcript identifier to plot (or a gene identifier, see below)
Path to output file: script creates a svg figure

For example:

/supply/plot_mutations.py ${REFERENCE_BED_FILE} ${PROJECT_DIR}/inact_mut_data.hdf5 ENST0000011111 test.svg

This will create a plot of all inactivating mutations detected for all projections of the ENST0000011111 transcript.

If you like to visualize all projections of a gene then:

Provide gene name instead of transcript ID
Also provide isoforms file using --isoforms_file parameter.