Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → mortazavilab → Transcriptclean

mortazavilab / Transcriptclean

Licence: mit

Correct mismatches, microindels, and noncanonical splice junctions in long reads that have been mapped to the genome

Programming Languages

python

139335 projects - #7 most used programming language

Labels

sam

Projects that are alternatives of or similar to Transcriptclean

fuc

Frequently used commands in bioinformatics

Stars: ✭ 23 (-28.12%)

Mutual labels: sam

Aws Cognito Apigw Angular Auth

A simple/sample AngularV4-based web app that demonstrates different API authentication options using Amazon Cognito and API Gateway with an AWS Lambda and Amazon DynamoDB backend that stores user details in a complete end to end Serverless fashion.

Stars: ✭ 278 (+768.75%)

Mutual labels: sam

Aws Toolkit Jetbrains

AWS Toolkit for JetBrains - a plugin for interacting with AWS from JetBrains IDEs

Stars: ✭ 514 (+1506.25%)

Mutual labels: sam

simplesam

Simple pure Python SAM parser and objects for working with SAM records

Stars: ✭ 50 (+56.25%)

Mutual labels: sam

BioD

A D library for computational biology and bioinformatics

Stars: ✭ 45 (+40.63%)

Mutual labels: sam

Sam

SAM: Sharpness-Aware Minimization (PyTorch)

Stars: ✭ 322 (+906.25%)

Mutual labels: sam

hts-python

pythonic wrapper for libhts (moved to: https://github.com/quinlan-lab/hts-python)

Stars: ✭ 48 (+50%)

Mutual labels: sam

Gofaas

A boilerplate Go and AWS Lambda app. Demonstrates an expert configuration of 10+ AWS services to support running Go functions-as-a-service (FaaS).

Stars: ✭ 731 (+2184.38%)

Mutual labels: sam

faaskit

A lightweight middleware framework for functions as a service

Stars: ✭ 24 (-25%)

Mutual labels: sam

Serverless Express

Run Node.js web applications and APIs using existing application frameworks on AWS #serverless technologies such as Lambda, API Gateway, Lambda@Edge, and ALB.

Stars: ✭ 4,265 (+13228.13%)

Mutual labels: sam

sam.pytorch

A PyTorch implementation of Sharpness-Aware Minimization for Efficiently Improving Generalization

Stars: ✭ 96 (+200%)

Mutual labels: sam

pheniqs

Fast and accurate sequence demultiplexing

Stars: ✭ 14 (-56.25%)

Mutual labels: sam

Aws Serverless Workshop Innovator Island

Welcome to the Innovator Island serverless workshop! This repo contains all the instructions and code you need to complete the workshop. Questions? Contact @jbesw.

Stars: ✭ 363 (+1034.38%)

Mutual labels: sam

hts-python

pythonic wrapper for htslib

Stars: ✭ 18 (-43.75%)

Mutual labels: sam

Htslib

C library for high-throughput sequencing data formats

Stars: ✭ 529 (+1553.13%)

Mutual labels: sam

aws-sam-build-images

AWS SAM build images

Stars: ✭ 21 (-34.37%)

Mutual labels: sam

A graphical text editor

Stars: ✭ 280 (+775%)

Mutual labels: sam

Dsinternals

Directory Services Internals (DSInternals) PowerShell Module and Framework

Stars: ✭ 776 (+2325%)

Mutual labels: sam

Aws Sam Cli

CLI tool to build, test, debug, and deploy Serverless applications using AWS SAM

Stars: ✭ 5,817 (+18078.13%)

Mutual labels: sam

Sambamba

Tools for working with SAM/BAM data

Stars: ✭ 409 (+1178.13%)

Mutual labels: sam

View All Similar Projects ➔

TranscriptClean

TranscriptClean is a Python program that corrects mismatches, microindels, and noncanonical splice junctions in long reads that have been mapped to the genome. It is designed for use with sam files from the PacBio Iso-seq and Oxford Nanopore transcriptome sequencing technologies. A variant-aware mode is available for users who want to avoid correcting away known variants in their data.

Note: At the present time, TranscriptClean does not work on SAM files that use X/= operators rather than M to represent matches in the CIGAR field. We are working on adding support for this in a future version.

Installation

The current TranscriptClean version is designed to be run with Python 3.7. It requires Bedtools to be installed, as well as Python modules pybedtools and pyfasta. These can be found at the links listed below:

Bedtools (v2.25.0): http://bedtools.readthedocs.io/en/latest/content/installation.html
Samtools (v1.9): https://github.com/samtools/samtools/releases/
pybedtools (v0.7.8): https://daler.github.io/pybedtools/
pyfasta (v0.5.2): https://pypi.python.org/pypi/pyfasta/

In addition, R (tested with v.3.3.2) is needed to run the visualization script, generate_report.R.

To install TranscriptClean, simply download the files using Github's "Download ZIP" button, then unzip them in the directory where you would like to install the program. Alternately, you can download a specific version of the program from the Releases tab. The TranscriptClean script can now be run directly from the command line- just include the path.

Usage

TranscriptClean is run from the command line as follows. Please note that releases 2.0+ can be run in multithreaded fashion. For fastest performance, we recommend sorting your input SAM file. For additional details and examples, please see the Wiki section.

python TranscriptClean.py --sam transcripts.sam --genome hg38.fa --outprefix /my/path/outfile

Basic Options

Option	Shortcut	Description
--help	-h	Print a list of the input options with descriptions
--sam	-s	Input sam file (mandatory). The aligner used to create it must be splice aware if you want to correct splice junctions.
--genome	-g	Reference genome fasta file (mandatory). Should be the same one used during alignment to generate the sam file.
--threads	-t	Number of threads to run program with. Default = 1.
--outprefix	-o	Prefix for the output files. Default = "out".

Options that control run mode

Option	Shortcut	Description
--dryRun	n/a	Include this option to run an inventory of all indels in the data without performing any correction. Useful for selecting maxLenIndel and maxSJOffset size
--correctMismatches	-m	If set to false, TranscriptClean will skip mismatch correction. Default = True.
--correctIndels	-i	If set to false, TranscriptClean will skip indel correction. Default = True.
--variants	-v	Optional: VCF-formatted file of variants to avoid correcting (this enables variant-aware correction). Irrelevant if correctMismatches is set to false.
--spliceJns	-j	High-confidence splice junction file obtained by mapping Illumina short reads to the genome using STAR. More formats may be supported in the future. This file is necessary if you want to correct noncanonical splice junctions.
--maxLenIndel	n/a	Maximum size indel to correct. Default = 5 bp.
--maxSJOffset	n/a	Maximum distance from annotated splice junction to correct. Default = 5 bp.
--primaryOnly	n/a	If this option is set, TranscriptClean will only output primary mappings of transcripts (ie it will filter out unmapped and multimapped lines from the SAM input.)
--canonOnly	n/a	If this option is set, TranscriptClean will only output transcripts that are either canonical or that contain annotated noncanonical junctions to the clean SAM and Fasta files at the end of the run.

Other options that may help tune performance

Option	Shortcut	Description
--tmp_dir	n/a	If you would like the tmp files to be written somewhere different than the final output, provide the path to that location here. For example, a tmp directory on the local drive of a compute node.
--bufferSize	n/a	Number of lines to output to file at once by each thread during run. Default = 100
--deleteTmp	n/a	If this option is set, the temporary directory generated by TranscriptClean (TC_tmp) will be removed at the end of the run.

Output files

TranscriptClean outputs the following files:

SAM file of corrected transcripts. Unmapped/non-primary transcript alignments from the input file are included in their original form.
Fasta file of corrected transcript sequences. Unmapped transcripts from the input file are included in their original form.
Transcript error log file (.TE.log): Each row represents a potential error in a given transcript. The column values track whether the error was corrected or not and why.
Transcript log file (.log): Each row represents a transcript. The columns track the mapping status of the transcript, as well as how many errors of each type were found and corrected/not corrected in the transcript.

Credit

Please cite our paper when using TranscriptClean:

Dana Wyman, Ali Mortazavi, TranscriptClean: variant-aware correction of indels, mismatches and splice junctions in long-read transcripts, Bioinformatics, Volume 35, Issue 2, 15 January 2019, Pages 340–342, https://doi.org/10.1093/bioinformatics/bty483

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 32

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (6) 🔗