All Projects β†’ GuoBioinfoLab β†’ CATT

GuoBioinfoLab / CATT

Licence: other
An ultra-sensitive and precise tool for characterizing T cell CDR3 sequences in TCR-seq and RNA-seq data.

Programming Languages

julia
2034 projects
python
139335 projects - #7 most used programming language
javascript
184084 projects - #8 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to CATT

DeepTCR
Deep Learning Methods for Parsing T-Cell Receptor Sequencing (TCRSeq) Data
Stars: ✭ 61 (+258.82%)
Mutual labels:  tcr, tcr-repertoire
immunarch
🧬 Immunarch by ImmunoMind: R Package for Fast and Painless Exploration of Single-cell and Bulk T-cell/Antibody Immune Repertoires
Stars: ✭ 204 (+1100%)
Mutual labels:  tcr, tcr-repertoire
SumStatsRehab
GWAS summary statistics files QC tool
Stars: ✭ 19 (+11.76%)
Mutual labels:  bioinformatics, bioinformatics-tool
PHAT
Pathogen-Host Analysis Tool - A modern Next-Generation Sequencing (NGS) analysis platform
Stars: ✭ 17 (+0%)
Mutual labels:  bioinformatics, bioinformatics-tool
geneview
Genomics data visualization in Python by using matplotlib.
Stars: ✭ 38 (+123.53%)
Mutual labels:  bioinformatics, bioinformatics-tool
mview
MView extracts and reformats the results of a sequence database search or multiple alignment.
Stars: ✭ 23 (+35.29%)
Mutual labels:  bioinformatics, bioinformatics-tool
nPhase
Ploidy agnostic phasing pipeline and algorithm
Stars: ✭ 18 (+5.88%)
Mutual labels:  bioinformatics
awesome-small-molecule-ml
A curated list of resources for machine learning for small-molecule drug discovery
Stars: ✭ 54 (+217.65%)
Mutual labels:  bioinformatics
MetaOmGraph
MetaOmGraph: a workbench for interactive exploratory data analysis of large expression datasets
Stars: ✭ 30 (+76.47%)
Mutual labels:  bioinformatics
saffrontree
SaffronTree: Reference free rapid phylogenetic tree construction from raw read data
Stars: ✭ 17 (+0%)
Mutual labels:  bioinformatics
SVCollector
Method to optimally select samples for validation and resequencing
Stars: ✭ 20 (+17.65%)
Mutual labels:  bioinformatics
companion
This repository has been archived, currently maintained version is at https://github.com/iii-companion/companion
Stars: ✭ 21 (+23.53%)
Mutual labels:  bioinformatics
CAFE5
Version 5 of the CAFE phylogenetics software
Stars: ✭ 53 (+211.76%)
Mutual labels:  bioinformatics
2017 2018-single-cell-RNA-sequencing-Workshop-UCD UCB UCSF
2017_2018 single cell RNA sequencing Workshop UCD_UCB_UCSF
Stars: ✭ 31 (+82.35%)
Mutual labels:  bioinformatics
CENTIPEDE.tutorial
πŸ› How to use CENTIPEDE to determine if a transcription factor is bound.
Stars: ✭ 23 (+35.29%)
Mutual labels:  bioinformatics
atropos
An NGS read trimming tool that is specific, sensitive, and speedy. (production)
Stars: ✭ 109 (+541.18%)
Mutual labels:  bioinformatics
staramr
Scans genome contigs against the ResFinder, PlasmidFinder, and PointFinder databases.
Stars: ✭ 52 (+205.88%)
Mutual labels:  bioinformatics
dnaio
Read and write FASTQ and FASTA efficiently from Python
Stars: ✭ 27 (+58.82%)
Mutual labels:  bioinformatics
CoNekT
CoNekT (short for Co-expression Network Toolkit) is a platform to browse co-expression data and enable cross-species comparisons.
Stars: ✭ 17 (+0%)
Mutual labels:  bioinformatics
calN50
Compute N50/NG50 and auN/auNG
Stars: ✭ 20 (+17.65%)
Mutual labels:  bioinformatics

CATT

What is CATT

CATT is an ultra-sensitive and accurate tool for characterizing T cell receptor sequence in bulk and single cell TCR-Seq and RNA-Seq data. The tool can be found in:

Overview

CATT(CharActerzing TCR repertoires) is a tool for detecting CDR3 sequences from any TCR containing raw sequencing data (including TCR-seq, RNA-seq, scRNA-seq and any TCR contained sequencing data)

The tool has the following feature:

  • One-button: CATT employs a totally data-driven algorithm, which is self-adaption to input data without any additional parameters.
  • Precisely and efficiently extract T cell CDR3 sequences from most types of TCR containing raw sequencing data. Based on specially designed assembly, CATT could recover more CDR3 sequences than other tools even from short reads.
  • Easy installation: Using Docker, CATT can be installed in any platform in one command.

Overview of the core algorithm of CATT. (A) Candidate CDR3 detection. All reads are aligned to V and J reference genes to select out candidate (brown) reads for micro-assembly. Potential CDR3 sequences were reconstruct by k-1 overlapped k-mers using k-mer frequency based greedy-feasible-flow algorithm. (B) Error correction. The motif criteria from IMGT project were employed to identify putative CDR3 sequences in directly found and assembled CDR3 sequences. CATT eliminates the erroneous CDR3 sequences using a data-driven transition-probability learning algorithms, which retrieves the probability of erroneous CDR3s from the observed CDR3 distribution and merges erroneous sequences (red) according to transition rates based on frequency and Hamming distance between the root and leaf sequences in the same subgroup. (C) Annotation and confidence assessment. After error correction, CATT employs a Bayes classification algorithm to assess the reliability of CDR3 sequences (differ from other protein-coding genes).

Newly update


Version 1.9 (2020-08)

  • Update BioSequence to 2.X

  • Significantly reduce the startup time

Version 1.8 (2020-04)

  • Bug fixes

  • Reduce memory consumption

  • Update reference genome to latest IMGT version

  • Add a option for user to specific k-mer length in assembly

Version 1.7 (2020-03)

  • Bug fixes

Version 1.6 (2020-01)

  • Bug fixes

  • Reduce the startup time

Version 1.4 (2019-10)

  • Add support for TCR CDR3 profiling for Pig
  • Improve multi-thread performance
  • Reduce Docker image size
  • Current support profiling

    Chain-Region Homo spanies Mus musculus Sus scrofa
    TRA - CDR1 Yes
    TRA - CDR2 Yes
    TRA - CDR3 Yes
    TRB - CDR1 Yes
    TRB - CDR2 Yes Yes
    TRB - CDR3 Yes Yes Yes
    IGH - CDR3 Yes

Version 1.3 (2019-9-4)

  • Refactor code to improve the expandability and performance
  • Add support for BCR (IGH CDR3 only)

Version 1.2

  • Change from Python to Julia, improve multi-thread performance.
  • Add support for 10X scTCR sequencing data.
  • Add support for alpha chain, CDR1, and CDR2

Installation

Docker Image (Recommended)

CATT can also be installed using Docker, Docker is a computer program that performs operating-system-level visualization. Using docker, users could easily install CATT and run CATT in virtual environment.

  1. Download and install Docker, recommend from Hompage (required ubuntu β‰₯ 14.04 )

  2. Download latest CATT docker image

docker pull guobioinfolab/catt:latest

This command will pull down the CATT from the docker hub (about ~5min needed to download the image, depend on the network speed). When execution is done, CATT have been installed successfully.

Source code (Not well tested)

CATT is written in Julia and python. Download the latest verison and accompanying files from by

git clone https://github.com/GuoBioinfoLab/CATT.git

Pre-requisites

To run CATT stand-alone, some packages and softwares are needed:

  • Python >= 3.7

    • argparse
  • Julia >= 1.3

    • DataFrames
    • CSV >= 0.5.14
    • GZip
    • BioAlignment
    • BioSequence >=2.0
    • FASTX
    • XAM
    • DataStructures
  • BWA

  • Samtools

Known Issue

The current version of BioSequence.jl seems have a bug which may report UndefVarError: x not defined during the runtime.

The solution is modify the file .julia/packages/BioSequences/k4j4J/src/composition.jl

// around Line 80
    for mer in iter
        counts[mer.fw] = get(counts, mer.fw, 0) + 1
    end

Configure

  1. Several parameters should be set well in the reference.jl
  • ref_prefix: The path of resource folder (contain), like /home/XXX/catt/resource
  • bwa_path: The executive file path of bwa, like /usr/bin/bwa. If the bwa is in the $PATH, this can be simply set as bwa
  • Samtools_path: The executive file path of samtools
  1. In file catt.jl
#Set the path of referece.jl prob.csv Jtool.jl and config.jl to PATH2CATT
const PATH2CATT = "/Users/kroaity/Documents/catt/github"
  1. Several parameters should be set well in the catt
  • In Line 48, the path of config.jl, make sure it is consistent with the path in catt.jl
  • In line 56, the path to catt.jl
  1. make catt executable and add it to global variable

    chmod u+x catt
    #add catt to ~/.bashrc
    export PATH="/path/to/catt:$PATH"

Test sample

We prepared a sample file (testSample.fq, can be downloaded from Github) for user test their installation and illustrating the CATT usage. Enter the folder contain the sample file and then run command:

# Docker image
docker run -it --rm -v $PWD:/output -w /output guobioinfolab/catt \
catt -f testSample.fq -o testSampleOutput -t 2

# Source code
catt -f testSample.fq -o testSampleOutput -t 2

If all goes well, a CSV format file with name testSampleOutput.TRB.CDR3.CATT.csv should be created in current folder.

Docker command explain:

β€”it and --rm flag are used to set docker container attribute, which are not important here. -v will mounts the specified directory on the host inside the container at the specified path. In this case, we mounting the $PWD (current directory, for linux user only) to /output directory inside the CATT image. We then output the results to this directory as -o /output/testSampleOutput. As /output is same directory as $PWD, you can find the result in the $PWD directory outside the CATT image

Usage

To integrate usage, users who install the CATT from Docker should always add following settings before command:

docker run -it --rm -v $PWD:/output -w /output -u $UID guobioinfolab/catt

#for example, from
catt -f testSample.fq -o testSampleOutput -t 2
#to
docker run -it --rm -v $PWD:/output -w /output guobioinfolab/catt \
catt -f testSample.fq -o testSampleOutput -t 2

Where $PWD is the path of folder contain your input data (absolute path, or just $PWD if input file is in current folder)

Basic

# For single-end input:
catt [option] -f inputFile -o outputName

# For paired-end input:
catt [option] --f1 inputFile1 --f2 inputFile2 -o outputName

# For bam input
catt [option] --bam -f inputFile -o outputName

option:

  • -t {numberOfThreads}: number of threads used by CATT. default: 4
  • -botw [int]: number of threads used for alignment. default: 4
  • -sc: Using Single-Cell mode. Using more aggressive error correction model. For single cell analysis, user should input each cell as a single file.
  • --bam [file_path]: Input format is bam/sam.
  • --region: Analysis CDR region. Could be one of CDR1/CDR2/CDR3 . default: CDR3
  • --chain: Analysis TCR chain. Could be one of TRA/TRB/IGH. default: TRB
  • --species: Could be hs,ms,pig default: hs
  • -k: Kmer length in assembly. Could be automatically be inferred from data if the option is not set, or accept a integer range in [5, 32]

Advance

Multiple input files

Parameter -f (for paired-end input is --f1 and --f2) and -o can accept multiple input files like:

catt [option] -f inputFile1 inputFile2 inputFile3 ... inputFileN \
-o outputName1 outputName2 outputName3 ... outputNameN

The input and output should be one-to-one correspondence.

As current version of Julia (v1.1) have a long startup time (~3s, will be fixed in next version), we recommend put all input in one command.

10X format data

As 10X sequencing becoming popular nowadays, we add the support for processing 10X scTCR-Seq data (In our evaluation, current 10X scRNA-seq is not suitable for TCR profiling, the reads number and length is under the minimum requirements). CATT will automatically read data, trim UMI, and do TCR profiling. (only support for the current version scTCR toolkit, 150bp paired-end, the first 16bp of Read1 is UMI and barcode sequence). CATT will output TCR for every cell (every barcode), in which some might be empty cell or derived from barcode error. User need to filter out such cells themself.

catt [option] --tenX --f1 R1 --f2 R2 -o outputName

Output explain

The output file of CATT is a CSV format file named like {prefix}_{chain}_{region}.CATT.csv . The file contain 7 columns:

  • AAseq: The acid amino sequence of TCR/BCR .
  • NNseq: The nucleotide sequence of TCR/BCR
  • Vregion, Jregion, Dregion: The used V(D)J gene segment.
  • Frequency: The frequency of TCR/BCR
  • Probability: The probability of sequence exist in current database (VDJdb and ImmunSEQ). It should be noted that the absolute value of this probability has no biology meaning. Higher the value is just imply the higher probability it occurs in current human-known database.

FAQ

  • Q: Got permission denied while trying to connect to the Docker when try to build docker image

    A: Make sure your user is in the docker group that have permission to use docker command

  • Q: Got ERROR: LoadError: failed process: Process(samtools view -F 2308, ProcessExited(1))

  • A: Please update the samtools to latest version.

Term of use

CATT is an sensitive and accurate tool for characterizing T cell receptor sequence in bulk and single cell TCR-Seq and RNA-Seq data, maintained by An-Yuan Guo Bioinformatics Lab (Guo Lab). Guo Lab may, from time to time, update the content on http://bioinfo.life.hust.edu.cn/CATT and https://github.com/GuoBioinfoLab/CATT. Guo Lab makes no warranties or representations, express or implied, with respect to any of the Content, including as to the present accuracy, completeness, timeliness, adequacy, or usefulness of any of the Content. By using this website, you agree that Guo Lab will not be liable for any losses or damages arising from your use of or reliance on the Content, or other websites or information to which this website may be linked.

CATT is freely accessible for research use in an academic setting. You may view the Content solely for your own personal reference or use for research in an academic setting. All academic research use of the Content must credit CATT as the source of the Content and reference these Terms of Use; outside of scientific publication, you may not otherwise redistribute or share the Content with any third party, in part or in whole, for any purpose, without the express permission of Guo Lab.

Unless you have signed a license agreement with Guo Lab, you may not use any part of the Content for any other purpose, including:

use or incorporation into a commercial product or towards performance of a commercial service; research use in a commercial setting; use for patient services; or generation of reports in a hospital or other patient care setting. You may not copy, transfer, reproduce, modify or create derivative works of CATT for any commercial purpose without the express permission of Guo Lab. If you seek to use CATT for such purposes, please request the license which best describes your anticipated use of CATT below:

  • Research use in commercial setting
  • Use in a commercial product
  • Use for patient services or reports in a hospital setting
  • Please contact me at [email protected]

Credit

Please cite our paper when using CATT


Copyright Guo Lab , College of Life Science and Technology , HUST , China

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].