All Projects → mortazavilab → Transcriptclean

mortazavilab / Transcriptclean

Licence: mit
Correct mismatches, microindels, and noncanonical splice junctions in long reads that have been mapped to the genome

Programming Languages

python
139335 projects - #7 most used programming language

Labels

Projects that are alternatives of or similar to Transcriptclean

fuc
Frequently used commands in bioinformatics
Stars: ✭ 23 (-28.12%)
Mutual labels:  sam
Aws Cognito Apigw Angular Auth
A simple/sample AngularV4-based web app that demonstrates different API authentication options using Amazon Cognito and API Gateway with an AWS Lambda and Amazon DynamoDB backend that stores user details in a complete end to end Serverless fashion.
Stars: ✭ 278 (+768.75%)
Mutual labels:  sam
Aws Toolkit Jetbrains
AWS Toolkit for JetBrains - a plugin for interacting with AWS from JetBrains IDEs
Stars: ✭ 514 (+1506.25%)
Mutual labels:  sam
simplesam
Simple pure Python SAM parser and objects for working with SAM records
Stars: ✭ 50 (+56.25%)
Mutual labels:  sam
BioD
A D library for computational biology and bioinformatics
Stars: ✭ 45 (+40.63%)
Mutual labels:  sam
Sam
SAM: Sharpness-Aware Minimization (PyTorch)
Stars: ✭ 322 (+906.25%)
Mutual labels:  sam
hts-python
pythonic wrapper for libhts (moved to: https://github.com/quinlan-lab/hts-python)
Stars: ✭ 48 (+50%)
Mutual labels:  sam
Gofaas
A boilerplate Go and AWS Lambda app. Demonstrates an expert configuration of 10+ AWS services to support running Go functions-as-a-service (FaaS).
Stars: ✭ 731 (+2184.38%)
Mutual labels:  sam
faaskit
A lightweight middleware framework for functions as a service
Stars: ✭ 24 (-25%)
Mutual labels:  sam
Serverless Express
Run Node.js web applications and APIs using existing application frameworks on AWS #serverless technologies such as Lambda, API Gateway, Lambda@Edge, and ALB.
Stars: ✭ 4,265 (+13228.13%)
Mutual labels:  sam
sam.pytorch
A PyTorch implementation of Sharpness-Aware Minimization for Efficiently Improving Generalization
Stars: ✭ 96 (+200%)
Mutual labels:  sam
pheniqs
Fast and accurate sequence demultiplexing
Stars: ✭ 14 (-56.25%)
Mutual labels:  sam
Aws Serverless Workshop Innovator Island
Welcome to the Innovator Island serverless workshop! This repo contains all the instructions and code you need to complete the workshop. Questions? Contact @jbesw.
Stars: ✭ 363 (+1034.38%)
Mutual labels:  sam
hts-python
pythonic wrapper for htslib
Stars: ✭ 18 (-43.75%)
Mutual labels:  sam
Htslib
C library for high-throughput sequencing data formats
Stars: ✭ 529 (+1553.13%)
Mutual labels:  sam
aws-sam-build-images
AWS SAM build images
Stars: ✭ 21 (-34.37%)
Mutual labels:  sam
A
A graphical text editor
Stars: ✭ 280 (+775%)
Mutual labels:  sam
Dsinternals
Directory Services Internals (DSInternals) PowerShell Module and Framework
Stars: ✭ 776 (+2325%)
Mutual labels:  sam
Aws Sam Cli
CLI tool to build, test, debug, and deploy Serverless applications using AWS SAM
Stars: ✭ 5,817 (+18078.13%)
Mutual labels:  sam
Sambamba
Tools for working with SAM/BAM data
Stars: ✭ 409 (+1178.13%)
Mutual labels:  sam

TranscriptClean

TranscriptClean is a Python program that corrects mismatches, microindels, and noncanonical splice junctions in long reads that have been mapped to the genome. It is designed for use with sam files from the PacBio Iso-seq and Oxford Nanopore transcriptome sequencing technologies. A variant-aware mode is available for users who want to avoid correcting away known variants in their data.

Note: At the present time, TranscriptClean does not work on SAM files that use X/= operators rather than M to represent matches in the CIGAR field. We are working on adding support for this in a future version.

Installation

The current TranscriptClean version is designed to be run with Python 3.7. It requires Bedtools to be installed, as well as Python modules pybedtools and pyfasta. These can be found at the links listed below:

In addition, R (tested with v.3.3.2) is needed to run the visualization script, generate_report.R.

To install TranscriptClean, simply download the files using Github's "Download ZIP" button, then unzip them in the directory where you would like to install the program. Alternately, you can download a specific version of the program from the Releases tab. The TranscriptClean script can now be run directly from the command line- just include the path.

Usage

TranscriptClean is run from the command line as follows. Please note that releases 2.0+ can be run in multithreaded fashion. For fastest performance, we recommend sorting your input SAM file. For additional details and examples, please see the Wiki section.

python TranscriptClean.py --sam transcripts.sam --genome hg38.fa --outprefix /my/path/outfile

Basic Options

Option Shortcut Description
--help -h Print a list of the input options with descriptions
--sam -s Input sam file (mandatory). The aligner used to create it must be splice aware if you want to correct splice junctions.
--genome -g Reference genome fasta file (mandatory). Should be the same one used during alignment to generate the sam file.
--threads -t Number of threads to run program with. Default = 1.
--outprefix -o Prefix for the output files. Default = "out".

Options that control run mode

Option Shortcut Description
--dryRun n/a Include this option to run an inventory of all indels in the data without performing any correction. Useful for selecting maxLenIndel and maxSJOffset size
--correctMismatches -m If set to false, TranscriptClean will skip mismatch correction. Default = True.
--correctIndels -i If set to false, TranscriptClean will skip indel correction. Default = True.
--variants -v Optional: VCF-formatted file of variants to avoid correcting (this enables variant-aware correction). Irrelevant if correctMismatches is set to false.
--spliceJns -j High-confidence splice junction file obtained by mapping Illumina short reads to the genome using STAR. More formats may be supported in the future. This file is necessary if you want to correct noncanonical splice junctions.
--maxLenIndel n/a Maximum size indel to correct. Default = 5 bp.
--maxSJOffset n/a Maximum distance from annotated splice junction to correct. Default = 5 bp.
--primaryOnly n/a If this option is set, TranscriptClean will only output primary mappings of transcripts (ie it will filter out unmapped and multimapped lines from the SAM input.)
--canonOnly n/a If this option is set, TranscriptClean will only output transcripts that are either canonical or that contain annotated noncanonical junctions to the clean SAM and Fasta files at the end of the run.

Other options that may help tune performance

Option Shortcut Description
--tmp_dir n/a If you would like the tmp files to be written somewhere different than the final output, provide the path to that location here. For example, a tmp directory on the local drive of a compute node.
--bufferSize n/a Number of lines to output to file at once by each thread during run. Default = 100
--deleteTmp n/a If this option is set, the temporary directory generated by TranscriptClean (TC_tmp) will be removed at the end of the run.

Output files

TranscriptClean outputs the following files:

  • SAM file of corrected transcripts. Unmapped/non-primary transcript alignments from the input file are included in their original form.
  • Fasta file of corrected transcript sequences. Unmapped transcripts from the input file are included in their original form.
  • Transcript error log file (.TE.log): Each row represents a potential error in a given transcript. The column values track whether the error was corrected or not and why.
  • Transcript log file (.log): Each row represents a transcript. The columns track the mapping status of the transcript, as well as how many errors of each type were found and corrected/not corrected in the transcript.

Credit

Please cite our paper when using TranscriptClean:

Dana Wyman, Ali Mortazavi, TranscriptClean: variant-aware correction of indels, mismatches and splice junctions in long-read transcripts, Bioinformatics, Volume 35, Issue 2, 15 January 2019, Pages 340–342, https://doi.org/10.1093/bioinformatics/bty483

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].