IsoQuant 2.3 manual

About IsoQuant
1.1. Supported data types
1.2. Supported reference data
Installation
2.1. Installing from conda
2.2. Manual installation and requirements
2.3. Verifying your installation
Running IsoQuant
3.1. IsoQuant input
3.2. Command line options
3.3. IsoQuant output
Citation
Feedback and bug reports

Quick start:

IsoQuant can be downloaded from https://github.com/ablab/IsoQuant or installed via conda:
```
conda create -c conda-forge -c bioconda -n isoquant python=3 isoquant
```
If installing manually, you will need Python3 (3.7 or higher), gffutils, pysam, pybedtools, biopython and some other common Python libraries to be installed. See requirements.txt for details. You will also need to have minimap2 and samtools to be in your $PATH variable.

To run IsoQuant on raw FASTQ/FASTA files use the following command

isoquant.py --reference /PATH/TO/reference_genome.fasta 
--genedb /PATH/TO/gene_annotation.gtf 
--fastq /PATH/TO/sample1.fastq.gz /PATH/TO/sample2.fastq.gz 
--data_type (assembly|pacbio_ccs|nanopore) -o OUTPUT_FOLDER

To run IsoQuant on aligned reads (make sure your BAM is sorted and indexed) use the following command:

  isoquant.py --reference /PATH/TO/reference_genome.fasta  
  --genedb /PATH/TO/gene_annotation.gtf 
  --bam /PATH/TO/sample1.sorted.bam /PATH/TO/sample2.sorted.bam 
  --data_type (assembly|pacbio_ccs|nanopore) -o OUTPUT_FOLDER

If using official annotations containing gene and transcript features use --complete_genedb to save time.

About IsoQuant

IsoQuant is a tool for reference-based analysis of long RNA reads, such as PacBio or Oxford Nanopores. IsoQuant maps reads to the reference genome and assigns them to the annotated isoforms based on their intron and exon structure. IsoQuant is also capable of discovering various modifications, such as intron retention, alternative splice sites, skipped exons etc. IsoQuant further performs gene, isoform, exon and intron quantification. If reads are grouped (e.g. according to cell type), counts are reported according to the provided grouping. In addition, IsoQuant generates discovered transcript models, including novel ones.

IsoQuant version 2.3.0 was released under GPLv2 on May 27th, 2022 and can be downloaded from https://github.com/ablab/IsoQuant.

IsoQuant pipeline

Supported data types

IsoQuant support all kinds of long RNA data:

PacBio CCS
ONT dRNA / ONT cDNA
Assembled / corrected transcript sequences

Reads must be provided in FASTQ or FASTA format (can be gzipped). If you have already aligned your reads to the reference genome, simply provide sorted and indexed BAM files.

Supported reference data

Reference genome should be provided in multi-FASTA format (can be gzipped). Reference genome is mandatory even when BAM files are provided.

Gene annotation can be provided in GFF/GTF format (can be gzipped). In this case it will be converted to gffutils database. Information on converted databases will be stored in your ~/.config/IsoQuant/db_config.json to increase speed of future runs. You can also provide gffutils database manually. Make sure that chromosome/scaffold names are identical in FASTA file and gene annotation.

Pre-constructed aligner index can also be provided to increase mapping time.

Installation

IsoQuant requires a 64-bit Linux system or Mac OS and Python (3.7 and higher) to be pre-installed on it. You will also need

Installing from conda

Isoquant can be installed with conda:

conda install -c bioconda isoquant

If this command does not work, it means that bioconda is not updated yet. Try installing via:

conda create -n isoquant python=3.7
conda activate isoquant
conda install -c bioconda -c conda-forge -c isoquant isoquant

Manual installation and requirements

To obtain IsoQuant you can download repository and install requirements.
Clone IsoQuant repository and switch to latest release:

git clone https://github.com/ablab/IsoQuant.git
cd IsoQuant
git checkout latest

Install requirements:

pip install -r requirements.txt

You also need samtools and minimap2 to be in the $PATH variable.

Verifying your installation

To verify IsoQuant installation type

isoquant.py --test

to run on toy dataset.
If the installation is successful, you will find the following information at the end of the log:

=== IsoQuant pipeline finished === 
=== TEST PASSED CORRECTLY ===

Running IsoQuant

IsoQuant input

To run IsoQuant, you should provide:

gene annotation in gffutils database or GTF/GFF format (can be gzipped);
reads in FASTA/FASTQ (can be gzipped) or sorted and indexed BAM;
reference sequence in FASTA format (can be gzipped).

By default, each file with reads is treated as a separate sample. To group multiple files into a single sample, provide a text files with paths to your FASTQ/FASTA/BAM files. Provide each file in a separate line, leave blank lines between samples. See more in examples.

IsoQuant command line options

Basic options

--output (or -o) Output folder, will be created automatically.

Note: if your output folder is located on a shared disk, use --genedb_output for storing annotation database.

--help (or -h) Prints help message.

--full_help Prints all available options (including hidden ones).

--test Runs IsoQuant on the toy data set.

Input options

--data_type or -d Type of data to process, supported types are: assembly, pacbio_ccs, nanopore. This option affects some of the algorithm parameters.

--genedb or -g Gene database in gffutils database format or GTF/GFF format (can be gzipped). If you use official gene annotations we recommend to set --complete_genedb option.

--complete_genedb Set this flag if gene annotation contains transcript and gene meta-features. Use this flag when providing official annotations, e.g. GENCODE. This option will set disable_infer_transcripts and disable_infer_genes gffutils options, which dramatically speeds up gene database conversion (see more here).

--reference or -r Reference genome in FASTA format (can be gzipped), required even when BAM files are provided.

--index Reference genome index for the specified aligner (minimap2 by default), can be provided only when raw reads are used as an input (constructed automatically if not set).

Using mapped reads as input:

To provide aligned reads use one of the following options:

--bam Sorted and indexed BAM file(s); each file will be treated as a separate sample.

--bam_list Text file with list of BAM files, one file per line, leave empty line between samples. You may also give an alias for each file specifying it after a colon (e.g. /PATH/TO/file.bam:replicate1). Use this option to obtain per-replicate expression table (see --read_group option).

Using raw read as an input:

To provide read sequences use one of the following options:

--fastq Input FASTQ/FASTA file(s), can be gzipped; each file will be treated as a separate sample.

--fastq_list Text file with list of FASTQ/FASTA files (can be gzipped), one file per line, leave empty line between samples. You may also give an alias for each file specifying it after a colon (e.g. /PATH/TO/file.fastq:replicate1). Use this option to obtain per-replicate expression table (see --read_group option).

Other input options:

--stranded Reads strandness type, supported values are: forward, reverse, none.

--fl_data Input sequences represent full-length transcripts; both ends of the sequence are considered to be reliable.

--labels or -l Sets space-separated sample names. Make sure that the number of labels is equal to the number of samples. Input file names are used as labels if not set.

--read_group Sets a way to group feature counts (e.g. by cell type or replicate). Available options are:

file_name: groups reads by their original file names (or file name aliases) within a sample. This option makes sense when a sample contains multiple files (see --bam_list and --fastq_list options to learn more). This option is designed for obtaining expression tables with a separate column for each file (replicate).
tag: groups reads by BAM file read tag: set tag:TAG, where TAG is the desired tag name (e.g. tag:RG with use RG values as groups);
read_id: groups reads by read name suffix: set read_id:DELIM where DELIM is the symbol/string by which the read id will be split (e.g. if DELIM is _, for read m54158_180727_042959_59310706_ccs_NEU the group will set as NEU);
file: uses additional file with group information for every read: file:FILE:READ_COL:GROUP_COL:DELIM, where FILE is the file name, READ_COL is column with read ids (0 if not set), GROUP_COL is column with group ids (1 if not set), DELIM is separator symbol (tab if not set). File can be gzipped.

Output options

--sqanti_output Produce SQANTI-like TSV output (requires more time).

--check_canonical Report whether read or constructed transcript model contains non-canonical splice junction (requires more time).

--count_exons Perform exon and intron counting in addition to gene and transcript counting.

Pipeline options

--threads or -t Number of threads to use, 16 by default.

--clean_start Do not use previously generated gene database, genome indices or BAM files, run pipeline from the very beginning (will take more time).

--no_model_construction Do not report transcript models, run read assignment and quantification of reference features only.

--run_aligner_only Align reads to the reference without running IsoQuant itself.

Algorithm parameters

Qunatification

--transcript_quantification Transcript quantification strategy, should be one of:

unique_only - only reads that are uniquely assigned and consistent with a transcript are used for quantification (default);
with_ambiguous - ambiguously assigned reads are split between transcripts with equal weights (e.g. 1/2 when a read is assigned to 2 transcripts simultaneously);
with_inconsistent - uniquely assigned reads with non-intronic inconsistencies (i.e. alternative poly-A site, TSS etc) are also included;
all - all of above.

--gene_quantification Gene quantification strategy, should be one of:

unique_only - only reads that are uniquely assigned to a gene and consistent with any of gene's isoforms are used for quantification;
with_ambiguous - ambiguously assigned reads are split between genes with equal weights (e.g. 1/2 when a read is assigned to 2 genes simultaneously);
with_inconsistent - only reads that are uniquely assigned to a gene but not necessary consistent with gene's isoforms (default);
all - all of above.

Read to isoform matching:

--matching_strategy A preset of parameters for read-to-isoform matching algorithm, should be one of:

exact - delta = 0, all minor errors are treated as inconsistencies;
precise - delta = 4, only minor alignment errors are allowed, default for PacBio data;
default - delta = 6, alignment errors typical for Nanopore reads are allowed, short novel introns are treated as deletions;
loose - delta = 12, even more serious inconsistencies are ignored, ambiguity is resolved based on nucleotide similarity.

Matching strategy is chosen automatically based on specified data type. However, the parameters will be overridden if the matching strategy is set manually.

Read alignment correction:

--splice_correction_strategy A preset of parameters for read alignment correction algorithms, should be one of:

none - no correction is applied;
default_pacbio - optimal settings for PacBio CCS reads;
default_ont - optimal settings for ONT reads;
conservative_ont - conservative settings for ONT reads, only incorrect splice junction and skipped exons are fixed;
assembly - optimal settings for a transcriptome assembly;
all - correct all discovered minor inconsistencies, may result in overcorrection.

This option is chosen automatically based on specified data type, but will be overridden if set manually.

Transcript model construction:

--model_construction_strategy A preset of parameters for transcript model construction algorithm, should be one of

reliable - only the most abundant and reliable transcripts are reported, precise, but not sensitive;
default_pacbio - optimal settings for PacBio CCS reads;
sensitive_pacbio - sensitive settings for PacBio CCS reads, more transcripts are reported possibly at a cost of precision;
fl_pacbio - optimal settings for full-length PacBio CCS reads, will be used if --data_type pacbio_ccs and --fl_data options are set;
default_ont - optimal settings for ONT reads;
sensitive_ont - sensitive settings for ONT reads, more transcripts are reported possibly at a cost of precision;
assembly - optimal settings for a transcriptome assembly: input sequences are considered to be reliable and each transcript to be represented only once, so abundance is not considered;
all - reports almost all novel transcripts, loses precision in favor to recall.

This option is chosen automatically based on specified data type, but will be overridden if set manually.

Hidden options

Options below are shown only with --full_help option. We recommend to not modify these options unless you are clearly aware of their effect.

--no_secondary Ignore secondary alignments.

--aligner Force to use this alignment method, can be starlong or minimap2; minimap2 is currently used as default. Make sure the specified aligner is in the $PATH variable.

--no_junc_bed Do not use annotation for read mapping.

--junc_bed_file Annotation in BED12 format produced by paftools.js gff2bed (can be found in minimap2), will be created automatically if not given.

--delta Delta for inexact splice junction comparison, chosen automatically based on data type.

--genedb_output If your output folder is located on a shared storage (e.g. NFS share), use this option to set another path for storing the annotation database, because SQLite database cannot be created on a shared disks. The folder will be created automatically.

Examples

Mapped PacBio CCS reads in BAM format; pre-converted gene annotation:

isoquant.py -d pacbio_ccs --bam mapped_reads.bam --genedb annotation.db --output output_dir

Nanopore dRNA stranded reads; official annotation in GTF format, used sample label instead of file name:

isoquant.py -d nanopore --stranded forward --fastq ONT.raw.fastq.gz --reference reference.fasta --genedb annotation.gtf --complete_genedb --output output_dir --labels My_ONT

PacBio FL reads; custom annotation in GTF format, which contains only exon features:

isoquant.py -d pacbio_ccs --fl_data --fastq CCS.fastq --reference reference.fasta --genedb genes.gtf --output output_dir

ONT cDNA reads; 2 samples with 3 replicates (biological or technical); official annotation in GTF format:

isoquant.py -d nanopore --fastq_list list.txt -l SAMPLE1 SAMPLE2 --reference reference.fasta  --complete_genedb --genedb genes.gtf --output output_dir

list.txt file :

/PATH/TO/SAMPLE1/file1.fastq:S1_REPLICATE1
/PATH/TO/SAMPLE1/file2.fastq:S1_REPLICATE2
/PATH/TO/SAMPLE1/file3.fastq:S1_REPLICATE3

/PATH/TO/SAMPLE2/file1.fastq:S2_REPLICATE1
/PATH/TO/SAMPLE2/file2.fastq:S2_REPLICATE2
/PATH/TO/SAMPLE2/file3.fastq:S2_REPLICATE3

Note, that file aliases given after a colon will be used in expression table header.

ONT cDNA reads; 2 samples with 2 replicates, each replicate has 2 files; official annotation in GTF format:

isoquant.py -d nanopore --fastq_list list.txt -l SAMPLE1 SAMPLE2 --reference reference.fasta  --complete_genedb --genedb genes.gtf --output output_dir

list.txt file :

/PATH/TO/SAMPLE1/r1_1.fastq:S1_REPLICATE1
/PATH/TO/SAMPLE1/r1_2.fastq:S1_REPLICATE1
/PATH/TO/SAMPLE1/r2_1.fastq:S1_REPLICATE2
/PATH/TO/SAMPLE1/r2_2.fastq:S1_REPLICATE2

/PATH/TO/SAMPLE2/r1_1.fastq:S2_REPLICATE1
/PATH/TO/SAMPLE2/r1_2.fastq:S2_REPLICATE1
/PATH/TO/SAMPLE2/r2_1.fastq:S2_REPLICATE2
/PATH/TO/SAMPLE2/r2_2.fastq:S2_REPLICATE2

Note, that file aliases given after a colon will be used in expression table header.

IsoQuant output

Output files

IsoQuant output files will be stored in <output_dir>, which is set by the user. If the output directory was not specified the files are stored in isoquant_output.
Output directory will contain one folder per sample with the following files:

SAMPLE_ID.read_assignments.tsv - TSV file with each read to isoform assignments;
SAMPLE_ID.corrected_reads.bed - BED file with corrected read alignments;
SAMPLE_ID.transcript_tpm.tsv - TSV file with isoform expression in TPM;
SAMPLE_ID.transcript_counts.tsv - TSV file with raw isoform counts;
SAMPLE_ID.gene_tpm.tsv - TSV file with gene expression in TPM;
SAMPLE_ID.gene_counts.tsv - TSV file with raw gene counts;
SAMPLE_ID.transcript_models.gtf - GTF file with constructed transcript models (both known and novel * SAMPLE_ID.transcript_models.gtf - GTF file with constructed transcript models; transcripts);
SAMPLE_ID.transcript_model_reads.tsv - TSV file indicating which reads contributed to transcript models;
SAMPLE_ID.transcript_model_tpm.tsv - expression of constructed transcript models in TPM;
SAMPLE_ID.transcript_model_counts.tsv - raw counts for constructed transcript models;
SAMPLE_ID.extended_annotation.gtf - GTF file with the entire reference annotation and all discovered novel transcripts;

If --sqanti_output is set, IsoQuant will save read assignments in SQANTI-like format:

SAMPLE_ID.SQANTI-like.tsv

If --count_exons is set, exon and intron counts will be produced:

SAMPLE_ID.exon_counts.tsv
SAMPLE_ID.intron_counts.tsv

If --read_group is set, the per-group counts will be also computed:

SAMPLE_ID.gene_grouped_tpm.tsv
SAMPLE_ID.transcript_grouped_tpm.tsv
SAMPLE_ID.gene_grouped_counts.tsv
SAMPLE_ID.transcript_grouped_counts.tsv
SAMPLE_ID.exon_grouped_counts.tsv
SAMPLE_ID.intron_grouped_counts.tsv

If multiple samples are provided, aggregated expression matrices will be placed in <output_dir>:

combined_gene_counts.tsv
combined_gene_tpm.tsv
combined_transcript_counts.tsv
combined_transcript_tpm.tsv

Additionally, a log file will be saved to the directory.

<output_dir>/isoquant.log

If raw reads were provided, BAM file(s) will be stored in <output_dir>/<SAMPLE_ID>/aux/.
In case --keep_tmp option was specified this directory will also contain temporary files.

Output file formats

Although most output files include headers that describe the data, a brief explanation of the output files is provided below.

Read to isoform assignment

Tab-separated values, the columns are:

read_id - read id;
chr - chromosome id;
strand - strand of the assigned isoform (not to be confused with read mapping strand);
isoform_id - isoform id to which the read was assigned;
gene_id - gene id to which the read was assigned;
assignment_type - assignment type, can be:
- unique - reads was unambiguously assigned to a single known isoform;
- unique_minor_difference - read was assigned uniquely but has alignment artifacts;
- inconsistent - read was matched with inconsistencies, closest match(es) are reported;
- ambiguous - read was assigned to multiple isoforms equally well;
- noninfomative - reads is intronic/intergenic.
assignment_events - list of detected inconsistencies; for each assigned isoform a list of detected inconsistencies relative to the respective isoform is stored; values in each list are separated by + symbol, lists are separated by comma, the number of lists equals to the number of assigned isoforms; possible events are (see graphical representation below):
- consistent events:
  - none / . / undefined - no special event detected;
  - mono_exon_match mono-exonic read matched to mono-exonic transcript;
  - fsm - full splice match;
  - ism_5/3 - incomplete splice match, truncated on 5'/3' side;
  - ism_internal - incomplete splice match, truncated on both sides;
  - mono_exonic - mono-exonic read matching spliced isoform;
- alignment artifacts:
  - intron_shift - intron that seems to be shifted due to misalignment (typical for Nanopores);
  - exon_misalignment - short exon that seems to be missed due to misalignment (typical for Nanopores);
  - fake_terminal_exon_5/3 - short terminal exon at 5'/3' end that looks like an alignment artifact (typical for Nanopores);
  - terminal_exon_misalignment_5/3 - missed reference short terminal exon;
  - exon_elongation_5/3 - minor exon extension at 5'/3' end (not exceeding 30bp);
  - fake_micro_intron_retention - short annotated introns are often missed by the aligners and thus are not considered as intron retention;
- intron retentions:
  - intron_retention - intron retention;
  - unspliced_intron_retention - intron retention by mono-exonic read;
  - incomplete_intron_retention_5/3 - terminal exon at 5'/3' end partially covers adjacent intron;
- significant inconsistencies (each type end with _known if all resulting read introns are annotated and _novel otherwise):
  - major_exon_elongation_5/3 - significant exon extension at 5'/3' end (exceeding 30bp);
  - extra_intron_5/3 - additional intron on the 5'/3' end of the isoform;
  - extra_intron - read contains additional intron in the middle of exon;
  - alt_donor_site - read contains alternative donor site;
  - alt_acceptor_site - read contains alternative annotated acceptor site;
  - intron_migration - read contains alternative annotated intron of approximately the same length as in the isoform;
  - intron_alternation - read contains alternative intron, which doesn't fall intro any of the categories above;
  - mutually_exclusive_exons - read contains different exon(s) of the same total length comparing to the isoform;
  - exon_skipping - read skips exon(s) comparing to the isoform;
  - exon_merge - read skips exon(s) comparing to the isoform, but a sequence of a similar length is attached to a neighboring exon;
  - exon_gain - read contains additional exon(s) comparing to the isoform;
  - exon_detach - read contains additional exon(s) comparing to the isoform, but a neighboring exon looses a sequnce of a similar length;
  - terminal_exon_shift - read has alternative terminal exon;
  - alternative_structure - reads has different intron chain that does not fall into any of categories above;
- alternative transcription start / end (reported when CAGE data / poly-A tails are present):
  - alternative_polya_site - read has alternative polyadenylation site;
  - internal_polya_site - poly-A tail detected but seems to be originated from A-rich intronic region;
  - correct_polya_site - poly-A site matches reference transcript end;
  - aligned_polya_tail - poly-A tail aligns to the reference;
  - alternative_tss - alternative transcription start site.
exons - list of coordinates for normalized read exons (1-based, indels and polyA exons are excluded);
additional - field for supplementary information, which may include:
- PolyA - True if poly-A tail is detected;
- CAGE - True if CAGE peak is found;
- Canonical - True if all read introns are canonical, Unspliced is used for mono-exon reads; (use --check_canonical)

Note, that a single read may occur more than once if assigned ambiguously.

Expression table format

Tab-separated values, the columns are:

feature_id - genomic feature ID;
TPM or count - expression value (float).

For grouped counts, each column contains expression values of a respective group. In the number of groups exceeds 10, file will contain 3 columns:

feature_id - genomic feature ID;
group_id - name of the assigned group;
TPM or count - expression value (float).

Exon and intron count format

Tab-separated values, the columns are:

chr - chromosome ID;
start - feature leftmost 1-based positions;
end - feature rightmost 1-based positions;
strand - feature strand;
flags - symbolic feature flags, can contain the following characters:
- X - terminal feature;
- I - internal feature;
- T - feature appears as both terminal and internal in different isoforms;
- S - feature has similar positions to some other feature;
- C - feature is contained in another feature;
- U - unique feature, appears only in a single known isoform;
- M - feature appears in multiple different genes.
gene_ids - list if gene ids feature belong to;
group_id - read group if provided (NA by default);
include_counts - number of reads that include this feature;
exclude_counts - number of reads that span, but do not include this feature;

Transcript models format

Constructed transcript models are stored in usual GTF format. Contains exon, transcript and gene features. Transcript ids have the following format: transcript_###.TYPE, where ### is the unique number (not necessarily consecutive) and TYPE can be one of the following:

known - previously annotated transcripts;
nic - novel in catalog, new transcript that contains only annotated introns;
nnic - novel not in catalog, new transcript that contains unannotated introns.

The attribute field also contains gene_id (either matches reference gene id or can be novel_gene_###), reference_gene_id (same value) and reference_transcript_id (either original isoform id or novel). In addition, it contains canonical property if --check_canonical is set.

Event classification figures

Consistent match classifications

Misalignment classifications

Inconsistency classifications

PolyA classifications

Citation

The preprint is available at https://www.researchsquare.com/article/rs-1571850/v1

Feedback and bug reports

Your comments, bug reports, and suggestions are very welcome. They will help us to further improve IsoQuant. If you have any troubles running IsoQuant, please send us isoquant.log from the <output_dir> directory.

You can leave your comments and bug reports at our GitHub repository tracker or send them via email: [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

ablab / IsoQuant

Programming Languages

Labels

Projects that are alternatives of or similar to IsoQuant

IsoQuant 2.3 manual

About IsoQuant

IsoQuant pipeline

Supported data types

Supported reference data

Installation

Installing from conda

Manual installation and requirements

Verifying your installation

Running IsoQuant

IsoQuant input

IsoQuant command line options

Basic options

Input options

Using mapped reads as input:

Using raw read as an input:

Other input options:

Output options

Pipeline options

Algorithm parameters

Qunatification

Read to isoform matching:

Read alignment correction:

Transcript model construction:

Hidden options

Examples

IsoQuant output

Output files

Output file formats

Read to isoform assignment

Expression table format

Exon and intron count format

Transcript models format

Event classification figures

Consistent match classifications

Misalignment classifications

Inconsistency classifications

PolyA classifications

Citation

Feedback and bug reports