All Projects → Psy-Fer → Squigglekit

Psy-Fer / Squigglekit

Licence: mit
SquiggleKit: A toolkit for manipulating nanopore signal data

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Squigglekit

Gubbins
Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins
Stars: ✭ 67 (-17.28%)
Mutual labels:  bioinformatics
Startapp
The START App: R Shiny Transcriptome Analysis Resource Tool
Stars: ✭ 73 (-9.88%)
Mutual labels:  bioinformatics
Biosequences.jl
Biological sequences for the julia language
Stars: ✭ 77 (-4.94%)
Mutual labels:  bioinformatics
Charger
Characterization of Germline variants
Stars: ✭ 69 (-14.81%)
Mutual labels:  bioinformatics
Coursera Specializations
Solutions to assignments of Coursera Specializations - Deep learning, Machine learning, Algorithms & Data Structures, Image Processing and Python For Everybody
Stars: ✭ 72 (-11.11%)
Mutual labels:  bioinformatics
Plass
Protein-Level ASSembler (PLASS): sensitive and precise protein assembler
Stars: ✭ 74 (-8.64%)
Mutual labels:  bioinformatics
Terpene Profile Parser For Cannabis Strains
Parser and database to index the terpene profile of different strains of Cannabis from online databases
Stars: ✭ 63 (-22.22%)
Mutual labels:  bioinformatics
Goenrich
GO enrichment with python -- pandas meets networkx
Stars: ✭ 80 (-1.23%)
Mutual labels:  bioinformatics
Bgt
Flexible genotype query among 30,000+ samples whole-genome
Stars: ✭ 72 (-11.11%)
Mutual labels:  bioinformatics
Sibeliaz
A fast whole-genome aligner based on de Bruijn graphs
Stars: ✭ 76 (-6.17%)
Mutual labels:  bioinformatics
Bcalm
compacted de Bruijn graph construction in low memory
Stars: ✭ 69 (-14.81%)
Mutual labels:  bioinformatics
Awesome Expression Browser
😎 A curated list of software and resources for exploring and visualizing (browsing) expression data 😎
Stars: ✭ 72 (-11.11%)
Mutual labels:  bioinformatics
Oswitch
Provides access to complex Bioinformatics software (even BioLinux!) in just one command.
Stars: ✭ 75 (-7.41%)
Mutual labels:  bioinformatics
Arcs
🌈Scaffold genome sequence assemblies using linked read sequencing data
Stars: ✭ 67 (-17.28%)
Mutual labels:  bioinformatics
Mygene.info
MyGene.info: A BioThings API for gene annotations
Stars: ✭ 79 (-2.47%)
Mutual labels:  bioinformatics
Gramtools
Genome inference from a population reference graph
Stars: ✭ 65 (-19.75%)
Mutual labels:  bioinformatics
Flowr
Robust and efficient workflows using a simple language agnostic approach
Stars: ✭ 73 (-9.88%)
Mutual labels:  bioinformatics
Edamontology
EDAM is an ontology of bioinformatics types of data including identifiers, data formats, operations and topics.
Stars: ✭ 80 (-1.23%)
Mutual labels:  bioinformatics
Svtyper
Bayesian genotyper for structural variants
Stars: ✭ 79 (-2.47%)
Mutual labels:  bioinformatics
Fastq.bio
An interactive web tool for quality control of DNA sequencing data
Stars: ✭ 76 (-6.17%)
Mutual labels:  bioinformatics

SquiggleKit

A toolkit for accessing and manipulating nanopore signal data

Full documentation: https://psy-fer.github.io/SquiggleKitDocs/

publication: SquiggleKit: A toolkit for manipulating nanopore signal data

Pre-print: SquiggleKit: A toolkit for manipulating nanopore signal data

Coming changes

Fast5_fetcher: merge single files into multi-fast5 files

SquigglePull: python3, read from multi-fast5

SquigglePlot: python3, read from multi-fast5, image size args, arg clean-up

Segmenter: dynamic file formats and more stability

MotifSeq: Improved background modelling, custom modelling, RNA specific tools, custom alignment methods

Overview

Tool Category Description
Fast5_fetcher File management Fetches fast5 files given a filtered input list
SquigglePull Signal extraction Extracts event or raw signal from data files
SquigglePlot Signal visualisation Visualisation tool for signal data
Segmenter Signal analysis Finds adapter stall, and homopolymer regions
MotifSeq Signal analysis Finds nucleotide sequence motifs in signal, i.e.“Ctrl+F”

Requirements

Following a self imposed guideline, most things written to handle nanopore data or bioinformatics in general, will use as little 3rd party libraries as possible, aiming for only core libraries, or have all included files in the package.

In the case of fast5_fetcher.py and batch_tater.py, only core python libraries are used. So as long as Python 2.7+ is present, everything should work with no extra steps.

There is one catch. Everything is written primarily for use with Linux. Due to MacOS running on Unix, so long as the GNU tools are installed (see below), there should be minimal issues running it. Windows however may require more massaging. The Windows-Subsystem-Linux must be installed. Follow the instructions here to do this.

SquiggleKit tools were not made to be executable to allow for use with varying python environments on various operating systems. To make them executable, add #! paths, such as #!/usr/bin/env python2.7 as the first line of each of the files, then add the SquiggleKit directory to the PATH variable in ~/.bashrc, export PATH="$HOME/path/to/SquiggleKit:$PATH"

Install

git clone https://github.com/Psy-Fer/SquiggleKit.git

Use pip for python 2 and pip3 for python 3. User environments may vary.

for fast5_fetcher.py, SquigglePull.py, segmenter.py:

  • numpy
  • matplotlib
  • h5py
  • sklearn
  • ont_fast5_api
pip install numpy h5py sklearn matplotlib

for MotifSeq.py:

  • all of the above
  • scipy
  • scrappie
  • mlpy 3.5.0 (only use pip3 in python 3 - see below)
pip install scipy scrappie

Installing mlpy:

Python2
Python3.6 (Python3.7 not working because of a cython/gsl issue)
pip3 install machine-learning-py

Quick start

fast5_fetcher

If using MacOS, and NOT using homebrew, install it here:

homebrew installation instructions

then install gnu-tar with:

brew install gnu-tar

Building the index (not required for multi_fast5)

How the index is built depends on which file structure you are using. It will work with both tarred and un-tarred file structures. Tarred is preferred. (zip and other archive methods are being investigated)

- Raw structure (not preferred)
for file in $(pwd)/reads/*/*;do echo $file; done >> name.index

gzip name.index
- Local basecalled structure
for file in $(pwd)/reads.tar; do echo $file; tar -tf $file; done >> name.index

gzip name.index
- Parallel basecalled structure
for file in $(pwd)/fast5/*fast5.tar; do echo $file; tar -tf $file; done >> name.index

If you have multiple experiments, then cat them all together and gzip.

for file in ./*.index; do cat $file; done >> ../all.name.index

gzip all.name.index
Basic use on a local computer

using a filtered paf file as input:

python fast5_fetcher.py -p my.paf -s sequencing_summary.txt.gz -i name.index.gz -o ./fast5

SquigglePull

All raw data:

python SquigglePull.py -rv -p ~/data/test/reads/1/ -f all > data.tsv

Positional event data:

python SquigglePull.py -ev -p ./test/ -t 50,150 -f pos1 > data.tsv

SquigglePlot

Plot individual fast5 file:

python SquigglePlot.py -i ~/data/test.fast5

Plot files in path

python SquigglePlot.py -p ~/data/ --plot_colour -g

Plot first 2000 data points of each read from signal file and save at 300dpi pdf:

python SquigglePlot.py -s signals.tsv.gz --plot_colour teal -n 2000 --dpi 300 --no_show o--save test.pdf --save_path ./test/plots/

Segmenter

Identify any segments in folder and visualise each one

Use f to full screen a plot, and ctrl+w to close a plot and move to the next one.

python segmenter.py -p ./test/ -v

Stall identification

python segmenter.py -s signals.tsv.gz -ku -j 100 > signals_stall_segments.tsv

MotifSeq

Find kmer motif:

fasta format for model:

my_kmer.fa

>my_kmer_name
ATCGATCGCTATGCTAGCATTACG

find the best match to that k-mer in the signal:

python MotifSeq.py -s signals.tsv -i my_kmer.fa > signals_kmer.tsv

MotifSeq

Limitations

k-mer length should not really be below 12nt, below this things get hairy based on modelling

The p-values and hit probabilities provided are based on loose modelling of negative background scores for a number of k-mers. It is currently only modelled on R9.4 model, not R10 or RNA.

Acknowledgements

I would like to thank the members of my lab, Shaun Carswell, Kirston Barton, Hasindu Gamaarachchi, Kai Martin, Tansel Ersavas, Brent Bevear, Jillian Hammond, and Martin Smith, from the Genomic Technologies team from the Garvan Institute for their feedback on the development of these tools.

License

The MIT License

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].