All Projects → tanghaibao → treecut

tanghaibao / treecut

Licence: other
Find nodes in hierarchical clustering that are statistically significant

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to treecut

Keras deep clustering
How to do Unsupervised Clustering with Keras
Stars: ✭ 202 (+676.92%)
Mutual labels:  clustering, unsupervised-learning
LabelPropagation
A NetworkX implementation of Label Propagation from a "Near Linear Time Algorithm to Detect Community Structures in Large-Scale Networks" (Physical Review E 2008).
Stars: ✭ 101 (+288.46%)
Mutual labels:  clustering, unsupervised-learning
Gemsec
The TensorFlow reference implementation of 'GEMSEC: Graph Embedding with Self Clustering' (ASONAM 2019).
Stars: ✭ 210 (+707.69%)
Mutual labels:  clustering, unsupervised-learning
Text Summarizer
Python Framework for Extractive Text Summarization
Stars: ✭ 96 (+269.23%)
Mutual labels:  clustering, unsupervised-learning
T-CorEx
Implementation of linear CorEx and temporal CorEx.
Stars: ✭ 31 (+19.23%)
Mutual labels:  clustering, unsupervised-learning
Awesome Community Detection
A curated list of community detection research papers with implementations.
Stars: ✭ 1,874 (+7107.69%)
Mutual labels:  clustering, unsupervised-learning
Revisiting-Contrastive-SSL
Revisiting Contrastive Methods for Unsupervised Learning of Visual Representations. [NeurIPS 2021]
Stars: ✭ 81 (+211.54%)
Mutual labels:  clustering, unsupervised-learning
Unsupervised Classification
SCAN: Learning to Classify Images without Labels (ECCV 2020), incl. SimCLR.
Stars: ✭ 605 (+2226.92%)
Mutual labels:  clustering, unsupervised-learning
dbscan
DBSCAN Clustering Algorithm C# Implementation
Stars: ✭ 38 (+46.15%)
Mutual labels:  clustering, unsupervised-learning
acoustic-keylogger
Pipeline of a keylogging attack using just an audio signal and unsupervised learning.
Stars: ✭ 80 (+207.69%)
Mutual labels:  clustering, unsupervised-learning
Self Supervised Learning Overview
📜 Self-Supervised Learning from Images: Up-to-date reading list.
Stars: ✭ 73 (+180.77%)
Mutual labels:  clustering, unsupervised-learning
LinearCorex
Fast, linear version of CorEx for covariance estimation, dimensionality reduction, and subspace clustering with very under-sampled, high-dimensional data
Stars: ✭ 39 (+50%)
Mutual labels:  clustering, unsupervised-learning
Bagofconcepts
Python implementation of bag-of-concepts
Stars: ✭ 18 (-30.77%)
Mutual labels:  clustering, unsupervised-learning
Danmf
A sparsity aware implementation of "Deep Autoencoder-like Nonnegative Matrix Factorization for Community Detection" (CIKM 2018).
Stars: ✭ 161 (+519.23%)
Mutual labels:  clustering, unsupervised-learning
Minisom
🔴 MiniSom is a minimalistic implementation of the Self Organizing Maps
Stars: ✭ 801 (+2980.77%)
Mutual labels:  clustering, unsupervised-learning
Spectralcluster
Python re-implementation of the spectral clustering algorithm in the paper "Speaker Diarization with LSTM"
Stars: ✭ 220 (+746.15%)
Mutual labels:  clustering, unsupervised-learning
dti-clustering
(NeurIPS 2020 oral) Code for "Deep Transformation-Invariant Clustering" paper
Stars: ✭ 60 (+130.77%)
Mutual labels:  clustering, unsupervised-learning
L2c
Learning to Cluster. A deep clustering strategy.
Stars: ✭ 262 (+907.69%)
Mutual labels:  clustering, unsupervised-learning
M-NMF
An implementation of "Community Preserving Network Embedding" (AAAI 2017)
Stars: ✭ 119 (+357.69%)
Mutual labels:  clustering, unsupervised-learning
kmeans
A simple implementation of K-means (and Bisecting K-means) clustering algorithm in Python
Stars: ✭ 18 (-30.77%)
Mutual labels:  clustering, unsupervised-learning

TREECUT: Dynamic tree cut algorithm

Latest PyPI version Travis-CI

Author Haibao Tang (tanghaibao)
Jingping Li (Jingping)
Email [email protected]
License BSD

Description

Hierarchical clustering is an important tool in mining useful relationships among multivariate biological data. However, there is no obvious way to define a set of useful, non-overlapping groups from the identified hierarchy. Most efforts have focused on different cut-off values, evaluate the relative strengths of intra- versus inter- group variances and then heuristically determine a "good" cutoff. This study introduces a more dynamic approach that extracts clades that are significantly enriched or different from other clades. Incorporating phylogenetic information removes the false positives observed in a conventional analysis thus improves the prediction of trait association.

The algorithm takes two inputs, a tree model and some mapping of values for all the terminal branches. Briefly, the algorithm performs independent statistical tests on all the internal branches, and calculates the P-values for each node. At exploratory stage, the statistical tests are: 1) for quantitative values, test the difference of two groups separated by each node (student's t-test); 2) for categorical values, test the association of a particular category for the descendants of each internal node (Fisher's exact test).

The candidate nodes are determined using the following rule: the P-value for the candidate node v has to be the smallest among all root-to-leaf paths that pass v. In other words, the group rooted at node v should contain the largest level of association, thus avoiding redundant clades.

Server version

A server version of TREECUT software can be found here: http://chibba.agtec.uga.edu/duplication/cut/

Installation

  • Python version >= 2.6
  • scipy for t-test and Fisher's Exact Test
  • ete2 for parsing the tree structure
pip install scipy ete2

Usage

Take a look at examples in the data/ folder: treefile and listfile.

The treefile should be a Newick-formatted file (typically from the output of a phylogenetic reconstruction software, e.g. phylip or MEGA).

The listfile should contain the quantitative value for each taxon (separated by comma). Make sure that the taxon names match between treefile and listfile:

# continuous example
IS13,57.2
IS35,66.13

If the data type is discrete, separate the classes by semicolon. For example:

# discrete example
AT1G02150,GO:0009507;GO:0005488
AT1G02160,GO:0005575;GO:0003674;GO:0008150

Note that # represents a comment line and will be ignored.

To run the software:

python treecut.py data/tree.nwk data/continuous.csv tree.pdf

A summary of extracted modules will be written to stdout. Each row will contain a subclade that show either significantly high phenotypic value or low phenotypic value. Further a visualization is available as tree.pdf (supported image formats include svg, png, pdf, jpg, etc.). The modules are highlighted in green (low-value modules) and red (high-value modules) colors.

tree-value mapping

Cookbook

There are several immediate applications of TREECUT. Below just show case two examples, but there are more.

Extract taxonomic groups with high/low phenotype values

See an example in the data/ folder. This is the flowering time data for sorghum diversity panel. flowering.nwk is a phylogenetic tree for the sorghum accessions used in the study. flowering.assoc has the mapping to the accession to the trait values (in this case the number of days until flowering). To run:

python treecut.py data/flowering.nwk data/flowering.assoc

If you stead want to treat the flowering data as discrete values, say "high" versus low. You can add a --discrete option:

python treecut.py data/flowering.nwk data/flowering_discrete.assoc --discrete flowering_discrete.png

The significant different clades (like extreme trait values) will be written to the screen.

Extract co-expressed genes with functional enrichment

In this example, I used Eisen's CLUSTER software (here) to process a series of arabidopsis microarray series AtGenExpress. After the CLUSTER is run. I found two files - microarray.cdt and microarray.gtr. The .gtr file contains a hierarchical tree structure, but I need to convert it to .nwk format in order for treecut.py to process.

Take a look at microarray.assoc, this contains the mapping from arabidopsis genes to the GO terms, which are based on the information downloaded at Gene Ontology website. Note that a gene can have multiple GO terms associated with it. Here is the script that I used to create the microarray.assoc:

python scripts/parse_tair_go.py

Once everything is set, just run treecut.py as usual (make sure to turn on the --discrete option):

python scripts/eisen_to_newick.py data/microarray.gtr data/microarray.cdt data/microarray.nwk
python treecut.py data/microarray.nwk data/microarray.assoc --discrete

The clades that are significantly enriched in certain GO terms will be written to the screen.

Reference

Tang et al. TREECUT: algorithm for extracting significant modules from hierarchical clustering

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].