All Projects → refresh-bio → PHIST

refresh-bio / PHIST

Licence: GPL-3.0 license
Phage-Host Interaction Search Tool

Programming Languages

C++
36643 projects - #6 most used programming language
python
139335 projects - #7 most used programming language
Makefile
30231 projects

Projects that are alternatives of or similar to PHIST

cljam
A DNA Sequence Alignment/Map (SAM) library for Clojure
Stars: ✭ 85 (+347.37%)
Mutual labels:  genomics
cerebra
A tool for fast and accurate summarizing of variant calling format (VCF) files
Stars: ✭ 55 (+189.47%)
Mutual labels:  genomics
viGEN
viGEN - A bioinformatics pipeline for the exploration of viral RNA in human NGS data
Stars: ✭ 24 (+26.32%)
Mutual labels:  genomics
HLA
xHLA: Fast and accurate HLA typing from short read sequence data
Stars: ✭ 84 (+342.11%)
Mutual labels:  genomics
MGSE
Mapping-based Genome Size Estimation (MGSE) performs an estimation of a genome size based on a read mapping to an existing genome sequence assembly.
Stars: ✭ 22 (+15.79%)
Mutual labels:  genomics
aws-genomics-workflows
Genomics Workflows on AWS
Stars: ✭ 131 (+589.47%)
Mutual labels:  genomics
Canvasxpress
JavaScript VisualizationTools
Stars: ✭ 247 (+1200%)
Mutual labels:  genomics
bfc
High-performance error correction for Illumina resequencing data
Stars: ✭ 66 (+247.37%)
Mutual labels:  genomics
GenomicsDB
Highly performant data storage in C++ for importing, querying and transforming variant data with C/C++/Java/Spark bindings. Used in gatk4.
Stars: ✭ 77 (+305.26%)
Mutual labels:  genomics
wgd
Python package and CLI for whole-genome duplication related analyses
Stars: ✭ 68 (+257.89%)
Mutual labels:  genomics
Mitty
Seven Bridges Genomics aligner/caller debugging and analysis tools
Stars: ✭ 13 (-31.58%)
Mutual labels:  genomics
berokka
🍊 💫 Trim, circularise and orient long read bacterial genome assemblies
Stars: ✭ 23 (+21.05%)
Mutual labels:  genomics
sequencework
programs and scripts, mainly python, for analyses related to nucleic or protein sequences
Stars: ✭ 22 (+15.79%)
Mutual labels:  genomics
fermi
A WGS de novo assembler based on the FMD-index for large genomes
Stars: ✭ 74 (+289.47%)
Mutual labels:  genomics
ezancestry
Easy genetic ancestry predictions in Python
Stars: ✭ 38 (+100%)
Mutual labels:  genomics
Hap.py
Haplotype VCF comparison tools
Stars: ✭ 249 (+1210.53%)
Mutual labels:  genomics
metaRNA
Find target sites for the miRNAs in genomic sequences
Stars: ✭ 19 (+0%)
Mutual labels:  genomics
simuG
simuG: a general-purpose genome simulator
Stars: ✭ 68 (+257.89%)
Mutual labels:  genomics
genipe
Genome-wide imputation pipeline
Stars: ✭ 28 (+47.37%)
Mutual labels:  genomics
assigner
Population assignment analysis using R
Stars: ✭ 17 (-10.53%)
Mutual labels:  genomics

PHIST

C/C++ CI

Phage-Host Interaction Search Tool

A tool to predict prokaryotic hosts for phage (meta)genomic sequences. PHIST links viruses to hosts based on the number of k-mers shared between their sequences.

Quick start

git clone --recurse-submodules https://github.com/refresh-bio/PHIST

cd PHIST
make

./phist.py ./example/virus ./example/host ./out/

Installation

PHIST uses Kmer-db as a submodule, therefore a recursive repository clone must be performed:

git clone --recurse-submodules https://github.com/refresh-bio/PHIST

Under Linux/OS X the package can be built by running MAKE in the project directory (G++ 5.3 tested):

cd PHIST
make

Under Windows one have to build Visual Studio 2015 solutions on kmer-db and utils subdirectories (use Release 64-bit configuration, as Python script depends on the default VS output directory structure).

Usage

PHIST takes as input two directories containing FASTA files (gzipped or not) with genomic sequences of viruses and candidate hosts (see example).

./phist.py [options] <virus_dir> <host_dir> <out_dir>

Positional arguments:

  • virus_dir Input directory w/ virus FASTA files (plain or gzip),
  • host_dir Input directory w/ host FASTA files (plain or gzip),
  • out_dir Output directory (will be created if it does not exist)

Options:

  • -k <kmer-length> k-mer length (default: 25, max: 30),
  • -t <num-threads> Number of threads (default: number of cores),
  • -h, --help Show this help message and exit,
  • --version Show tool's version number and exit.

Output format

PHIST outputs two CSV files. One containing a table of common k-mers between phages and hosts, and second file with virus-host predictions.

Common k-mers table

The common_kmers.csv file stores numbers of common k-mers between phages (in columns) and hosts (in rows) in a sparse form. Specifically, zeros are omitted while non-zero k-mer counts are represented as pairs (column_number : value) with 1-based column indexing. Thus, rows may have different number of elements, e.g.:

kmer-length: k fraction: f phages φ1 φ2 ... φn
hosts total-kmers |φ1| |φ2| ... |φn|
h1 |h1| i11 : |h1 ∩ φi11| i12 : |h1 ∩ φi12|
h2 |h2| i21 : |h2 ∩ φi21| i22 : |h2 ∩ φi22| i23 : |h2 ∩ φi23|
h2 |h2|
... ... ...
hm |hm| im1 : |hm ∩ φim1|

where:

  • k - k-mer length,
  • φ1, φ2, ..., φn - phage names,
  • h1, h2, ..., hm - host names,
  • |a| - number of k-mers in sample a,
  • |a ∩ b| - number of k-mers common for samples a and b.

Host predictions

The predictions.csv file assigns each phage to its most likely host (i.e., the one having most k-mers in common). If there are multiple potential hosts with same number of common k-mers, all are reported. Each virus-host interaction is followed by p-value and adjusted p-value for multiple comparisons.

phage host common k-mers p-value adj. p-value
φ1 host( φ1) |φ1host(φ1)| ... ...
φ2 host( φ2) |φ2host(φ2)| ... ...
φ3 host1( φ3) |φ3host1(φ3)| ... ...
φ3 host2( φ3) |φ3host2(φ3)| ... ...
... ... ... ... ...

Further analysis

The utils/matcher tool retrieves the list of all exact matches of legnth >= k for a given pair of phage and host FASTA sequences. The matches are provided with their coordinates in the viral and corresponding bacterial genome (a reversed interval in the latter indicates a reverse complement match).

Usage

./utils/matcher [options] <virus> <host> <output>

Positional arguments:

  • virus virus FASTA file (gzipped or not),
  • host host FASTA file (gzipped or not),
  • output output CSV file

Options:

  • -k --k <kmer-length> k-mer length (default: 25, max: 30, may be different than the one used in the PHIST execution),

Example

./utils/matcher example/virus/NC_024123.fna example/host/NC_017548.fna shared_regions.csv
example/virus/NC_024123.fna,example/host/NC_017548.fna
NC_024123.1:52942-52968,NC_017548.1:1456873-1456847
NC_024123.1:52970-53009,NC_017548.1:1456845-1456806
NC_024123.1:53011-53102,NC_017548.1:1456804-1456713
NC_024123.1:53107-53147,NC_017548.1:1456708-1456668
NC_024123.1:53830-53854,NC_017548.1:2647971-2647947
NC_024123.1:54794-54827,NC_017548.1:679998-679965

Citing

Zielezinski A, Deorowicz S, Gudyś A. PHIST: fast and accurate prediction of prokaryotic hosts from metagenomic viral sequences, Bioinformatics. 2022, 38(5):1447-9. doi:10.1093/bioinformatics/btab837.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].