All Projects → pirovc → multitax

pirovc / multitax

Licence: MIT license
Python package to obtain, parse and explore biological taxonomies (GTDB, NCBI, Silva, Greengenes, OTT)

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to multitax

ncbitax2lin
🐞 Convert NCBI taxonomy dump into lineages
Stars: ✭ 113 (+413.64%)
Mutual labels:  taxonomy, ncbi
ncbi acc2gtdb acc
Mapping NCBI Genbank accession to GTDB accession
Stars: ✭ 14 (-36.36%)
Mutual labels:  ncbi, gtdb
vue-virtualised
Blazing fast scrolling and updating for any amount of list and hierarchical data.
Stars: ✭ 18 (-18.18%)
Mutual labels:  tree
immutable-gametree
An immutable game tree data type.
Stars: ✭ 18 (-18.18%)
Mutual labels:  tree
LoveTree
🌴爱情树,将相爱的时刻永远珍藏 (微信,QQ可完美查看)https://ajlovechina.github.io/LoveTree/
Stars: ✭ 295 (+1240.91%)
Mutual labels:  tree
vscode-gcode-syntax
G Code Language Extension for Visual Studio Code. Turn VSCode into a fully capable G-Code editor, including language support & more.
Stars: ✭ 59 (+168.18%)
Mutual labels:  tree
service-tree
[ABANDONED] A tree that stores services in its node for a given key, and allows traversing them.
Stars: ✭ 33 (+50%)
Mutual labels:  tree
mcts
🌳 Domain independent implementation of Monte Carlo Tree Search methods.
Stars: ✭ 15 (-31.82%)
Mutual labels:  tree
svelte-mindmap
Svelte component for MindMap
Stars: ✭ 122 (+454.55%)
Mutual labels:  tree
android-thinkmap-treeview
Tree View; Mind map; Think map; tree map; custom view; 自定义;关系图;树状图;思维导图;组织机构图;层次图
Stars: ✭ 314 (+1327.27%)
Mutual labels:  tree
pytaxize
python port of taxize (taxonomy toolbelt) for R
Stars: ✭ 29 (+31.82%)
Mutual labels:  taxonomy
flora-on-server
Work collaboratively in IUCN Red List assessments and manage online biodiversity databases, including a graph-based taxonomy system, species traits and habitats, and species occurrences.
Stars: ✭ 13 (-40.91%)
Mutual labels:  taxonomy
gtree
Output tree🌳 or Make directories📁 from #Markdown or Programmatically. Provide CLI, Golang library and Web (using #Wasm ).
Stars: ✭ 88 (+300%)
Mutual labels:  tree
kvstore
KVStore is a simple Key-Value Store based on B+Tree (disk & memory) for Java
Stars: ✭ 88 (+300%)
Mutual labels:  tree
bio
A lightweight and high-performance bioinformatics package in Golang
Stars: ✭ 80 (+263.64%)
Mutual labels:  taxonomy
Raskoh
Raskoh - easy and object oriented way to interact with WordPress custom post types and taxonomies.
Stars: ✭ 12 (-45.45%)
Mutual labels:  taxonomy
ztree-for-react
jQuery zTreeV3.x 插件react封装
Stars: ✭ 22 (+0%)
Mutual labels:  tree
github-pr-diff-tree
🌲 This action provide a comment that displays the diff of the pull request in a tree format.
Stars: ✭ 31 (+40.91%)
Mutual labels:  tree
merkle-patricia-tree
☔️🌲 A fast, in-memory optimized merkle patricia tree
Stars: ✭ 22 (+0%)
Mutual labels:  tree
DrawRacket4Me
DrawRacket4Me draws trees and graphs from your code, making it easier to check if the structure is what you wanted.
Stars: ✭ 43 (+95.45%)
Mutual labels:  tree

MultiTax Build Status codecov install with bioconda

Python package to obtain, parse and explore biological taxonomies

Description

MultiTax is a Python package that provides a common and generalized set of functions to download, parse, filter and explore multiple biological taxonomies (GTDB, NCBI, Silva, Greengenes, Open Tree taxonomy) and custom formatted taxonomies. Main goals are:

  • Be fast, intuitive, generalized and easy to use
  • Explore different taxonomies with same set of commands
  • Enable integration and compatibility with multiple taxonomies
  • Translate and convert taxonomies (not yet implemented)

MultiTax does not link sequence identifiers to taxonomic nodes, it just handles the taxonomy alone. Some kind of integration to work with sequence ids is planned, but not yet implemented.

Installation

pip

pip install multitax

conda

conda install -c bioconda multitax

local

git clone https://github.com/pirovc/multitax.git
cd multitax
python setup.py install --record files.txt

Documentation

https://pirovc.github.io/multitax/

Basic Example with GTDB

from multitax import GtdbTx

# Download and parse taxonomy
tax = GtdbTx()

# Get lineage for the Escherichia genus  
tax.lineage("g__Escherichia")
# ['1', 'd__Bacteria', 'p__Proteobacteria', 'c__Gammaproteobacteria', 'o__Enterobacterales', 'f__Enterobacteriaceae', 'g__Escherichia']

Further Examples

List of all functions

Obtain/load/parse taxonomy

from multitax import GtdbTx  # or NcbiTx, SilvaTx, ...

# Download and parse in memory
tax = GtdbTx()

# Parse local files
tax = GtdbTx(files=["bac120_taxonomy.tsv.gz", "ar122_taxonomy.tsv.gz"])

# Download, write and parse files
tax = GtdbTx(output_prefix="my/path/") 

# Download and filter only specific branch
tax = GtdbTx(root_node="p__Proteobacteria") 

Explore

# List parent node
tax.parent("g__Escherichia")
# f__Enterobacteriaceae

# List children nodes
tax.children("g__Escherichia")
# ['s__Escherichia flexneri', 's__Escherichia coli', 's__Escherichia dysenteriae', 's__Escherichia coli_D', 's__Escherichia albertii', 's__Escherichia marmotae', 's__Escherichia coli_C', 's__Escherichia sp005843885', 's__Escherichia sp000208585', 's__Escherichia fergusonii', 's__Escherichia sp001660175', 's__Escherichia sp004211955', 's__Escherichia sp002965065']

# Get parent node from a defined rank
tax.parent_rank("s__Lentisphaera araneosa", "phylum")
# 'p__Verrucomicrobiota'

# Get the closest parent from a list of ranks
tax.closest_parent("s__Lentisphaera araneosa", ranks=["phylum", "class", "family"])
# 'f__Lentisphaeraceae'

# Get lineage
tax.lineage("g__Escherichia")
# ['1', 'd__Bacteria', 'p__Proteobacteria', 'c__Gammaproteobacteria', 'o__Enterobacterales', 'f__Enterobacteriaceae', 'g__Escherichia']

# Get lineage of names
tax.name_lineage("g__Escherichia")
# ['root', 'Bacteria', 'Proteobacteria', 'Gammaproteobacteria', 'Enterobacterales', 'Enterobacteriaceae', 'Escherichia']

# Get lineage of ranks
tax.rank_lineage("g__Escherichia")
# ['root', 'domain', 'phylum', 'class', 'order', 'family', 'genus']

# Get lineage with defined ranks and root node
tax.lineage("g__Escherichia", root_node="p__Proteobacteria", ranks=["phylum", "class", "family", "genus"])
# ['p__Proteobacteria', 'c__Gammaproteobacteria', 'f__Enterobacteriaceae', 'g__Escherichia']

# Build lineages in memory for faster access
tax.build_lineages()

# Get leaf nodes
tax.leaves("p__Hadarchaeota")
# ['s__DG-33 sp004375695', 's__DG-33 sp001515185', 's__Hadarchaeum yellowstonense', 's__B75-G9 sp003661465', 's__WYZ-LMO6 sp004347925', 's__B88-G9 sp003660555']

# Search names and filter by rank
tax.search_name("Escherichia", exact=False, rank="genus")
# ['g__Escherichia', 'g__Escherichia_C']

# Show stats of loaded tax
tax.stats()
#{'leaves': 31910,
# 'names': 45503,
# 'nodes': 45503,
# 'ranked_leaves': Counter({'species': 31910}),
# 'ranked_nodes': Counter({'species': 31910,
#                          'genus': 9428,
#                          'family': 2600,
#                          'order': 1034,
#                          'class': 379,
#                          'phylum': 149,
#                          'domain': 2,
#                          'root': 1}),
# 'ranks': 45503}

# Filter ancestors (desc=True for descendants)
tax.filter(['g__Escherichia', 's__Pseudomonas aeruginosa'])
tax.stats()
#{'leaves': 2,
# 'names': 11,
# 'nodes': 11,
# 'ranked_leaves': Counter({'genus': 1, 'species': 1}),
# 'ranked_nodes': Counter({'genus': 2,
#                          'family': 2,
#                          'order': 2,
#                          'class': 1,
#                          'phylum': 1,
#                          'domain': 1,
#                          'species': 1,
#                          'root': 1}),
# 'ranks': 11}

# Write tax to file
tax.write("custom_tax.tsv", cols=["node", "rank", "name_lineage"])

#g__Escherichia             genus    root|Bacteria|Proteobacteria|Gammaproteobacteria|Ent#erobacterales|Enterobacteriaceae|Escherichia
#f__Enterobacteriaceae      family   root|Bacteria|Proteobacteria|Gammaproteobacteria|Enterobacterales|Enterobacteriaceae
#o__Enterobacterales        order    root|Bacteria|Proteobacteria|Gammaproteobacteria|Enterobacterales
#c__Gammaproteobacteria     class    root|Bacteria|Proteobacteria|Gammaproteobacteria
#...

The same goes for the other taxonomies

# NCBI
from multitax import NcbiTx
tax = NcbiTx()
tax.lineage("561")    
# ['1', '131567', '2', '1224', '1236', '91347', '543', '561']

# Silva
from multitax import SilvaTx
tax = SilvaTx()
tax.lineage("46463")    
# ['1', '3', '2375', '3303', '46449', '46454', '46463']

# Open Tree taxonomy
from multitax import OttTx
tax = OttTx()
tax.lineage("474503")
# ['805080', '93302', '844192', '248067', '822744', '768012', '424023', '474503']

# GreenGenes
from multitax import GreengenesTx
tax = GreengenesTx()
tax.lineage("f__Enterobacteriaceae")
# ['1', 'k__Bacteria', 'p__Proteobacteria', 'c__Gammaproteobacteria', 'o__Enterobacteriales', 'f__Enterobacteriaceae']

LCA integration

Using pylca: https://github.com/pirovc/pylca

conda install -c bioconda pylca
from pylca.pylca import LCA
from multitax import GtdbTx

# Download and parse GTDB Taxonomy
tax = GtdbTx()

# Build LCA structure
L = LCA(tax._nodes)

# Get LCA
L("s__Escherichia dysenteriae", "s__Pseudomonas aeruginosa")
# 'c__Gammaproteobacteria'

General information

  • Taxonomies are parsed into nodes. Each node is annotated with a name and a rank.
  • Some taxonomies have a numeric taxonomic identifier (e.g. NCBI) and other use the rank + name as an identifier (e.g. GTDB). In MultiTax all identifiers are treated as strings.
  • A single root node is defined by default for each taxonomy (or 1 when not defined). This can be changed with root_node when loading the taxonomy (as well as annotations root_parent, root_name, root_rank). If the root_node already exists, the tree will be filtered.
  • Standard values for unknown/undefined nodes can be configured with undefined_node,undefined_name and undefined_rank. Those are default values returned when nodes/names/ranks are not found.
  • Taxonomy files are automatically download or can be loaded from disk (files parameter). Alternative urls can be provided. When downloaded, files are handled in memory. It is possible to save the downloaded file to disk with output_prefix.

Translation between taxonomies

Not yet implemented. The goal here is to map different taxonomies if the linkage data is available. That's what I think will be possible.

from/to NCBI GTDB SILVA OTT GG
NCBI - part part part no
GTDB full - no no no
SILVA full no - part no
OTT part no part - no
GG no no no no -

Further ideas

  • Add/remove/update nodes
  • Conversion between taxonomies (write on specific files/format)

Similar projects

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].