All Projects → abhilash1910 → ClusterTransformer

abhilash1910 / ClusterTransformer

Licence: other
Topic clustering library built on Transformer embeddings and cosine similarity metrics.Compatible with all BERT base transformers from huggingface.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to ClusterTransformer

Pytorch Seq2seq
Tutorials on implementing a few sequence-to-sequence (seq2seq) models with PyTorch and TorchText.
Stars: ✭ 3,418 (+9394.44%)
Mutual labels:  transformer, pytorch-implementation
Deeplearning Nlp Models
A small, interpretable codebase containing the re-implementation of a few "deep" NLP models in PyTorch. Colab notebooks to run with GPUs. Models: word2vec, CNNs, transformer, gpt.
Stars: ✭ 64 (+77.78%)
Mutual labels:  embeddings, transformer
towhee
Towhee is a framework that is dedicated to making neural data processing pipelines simple and fast.
Stars: ✭ 821 (+2180.56%)
Mutual labels:  embeddings, transformer
AdaSpeech
AdaSpeech: Adaptive Text to Speech for Custom Voice
Stars: ✭ 108 (+200%)
Mutual labels:  transformer, pytorch-implementation
VT-UNet
[MICCAI2022] This is an official PyTorch implementation for A Robust Volumetric Transformer for Accurate 3D Tumor Segmentation
Stars: ✭ 151 (+319.44%)
Mutual labels:  transformer, pytorch-implementation
bert in a flask
A dockerized flask API, serving ALBERT and BERT predictions using TensorFlow 2.0.
Stars: ✭ 32 (-11.11%)
Mutual labels:  transformer, albert
Keras Textclassification
中文长文本分类、短句子分类、多标签分类、两句子相似度(Chinese Text Classification of Keras NLP, multi-label classify, or sentence classify, long or short),字词句向量嵌入层(embeddings)和网络层(graph)构建基类,FastText,TextCNN,CharCNN,TextRNN, RCNN, DCNN, DPCNN, VDCNN, CRNN, Bert, Xlnet, Albert, Attention, DeepMoji, HAN, 胶囊网络-CapsuleNet, Transformer-encode, Seq2seq, SWEM, LEAM, TextGCN
Stars: ✭ 914 (+2438.89%)
Mutual labels:  embeddings, transformer
Walk-Transformer
From Random Walks to Transformer for Learning Node Embeddings (ECML-PKDD 2020) (In Pytorch and Tensorflow)
Stars: ✭ 26 (-27.78%)
Mutual labels:  transformer, pytorch-implementation
Vectorai
Vector AI — A platform for building vector based applications. Encode, query and analyse data using vectors.
Stars: ✭ 195 (+441.67%)
Mutual labels:  clustering, embeddings
watset-java
An implementation of the Watset clustering algorithm in Java.
Stars: ✭ 24 (-33.33%)
Mutual labels:  clustering, embeddings
NLP-paper
🎨 🎨NLP 自然语言处理教程 🎨🎨 https://dataxujing.github.io/NLP-paper/
Stars: ✭ 23 (-36.11%)
Mutual labels:  transformer, albert
TitleStylist
Source code for our "TitleStylist" paper at ACL 2020
Stars: ✭ 72 (+100%)
Mutual labels:  transformer, pytorch-implementation
Contextualized Topic Models
A python package to run contextualized topic modeling. CTMs combine BERT with topic models to get coherent topics. Also supports multilingual tasks. Cross-lingual Zero-shot model published at EACL 2021.
Stars: ✭ 318 (+783.33%)
Mutual labels:  embeddings, transformer
Embedding As Service
One-Stop Solution to encode sentence to fixed length vectors from various embedding techniques
Stars: ✭ 151 (+319.44%)
Mutual labels:  embeddings, transformer
vietnamese-roberta
A Robustly Optimized BERT Pretraining Approach for Vietnamese
Stars: ✭ 22 (-38.89%)
Mutual labels:  transformer, bert-embeddings
Representation-Learning-for-Information-Extraction
Pytorch implementation of Paper by Google Research - Representation Learning for Information Extraction from Form-like Documents.
Stars: ✭ 82 (+127.78%)
Mutual labels:  transformer, pytorch-implementation
Transformer-MM-Explainability
[ICCV 2021- Oral] Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.
Stars: ✭ 484 (+1244.44%)
Mutual labels:  transformer
ML-Track
This repository is a recommended track, designed to get started with Machine Learning.
Stars: ✭ 19 (-47.22%)
Mutual labels:  clustering
ActiveSparseShifts-PyTorch
Implementation of Sparse Shift Layer and Active Shift Layer (3D, 4D, 5D tensors) for PyTorch(CPU,GPU)
Stars: ✭ 27 (-25%)
Mutual labels:  pytorch-implementation
embedding evaluation
Evaluate your word embeddings
Stars: ✭ 32 (-11.11%)
Mutual labels:  embeddings

ClusterTransformer

A Topic Clustering Library made with Transformer Embeddings 🤖

This is a topic clustering library built with transformer embeddings and analysing cosine similarity between them. The topics are clustered either by kmeans or agglomeratively depending on the use case, and the embeddings are attained after propagating through any of the Transformers present in HuggingFace.The library can be found here.

Dependencies

Pytorch

Transformers

Usability

Installation is carried out using the pip command as follows:

pip install ClusterTransformer==0.1

For using inside the Jupyter Notebook or Python IDE:

import ClusterTransformer.ClusterTransformer as ct

The 'ClusterTransformer_test.py' file contains an example of using the Library in this context.

Usability Overview

The steps to operate this library is as follows:

Initialise the class: ClusterTransformer() Provide the input list of sentences: In this case, the quora similar questions dataframe has been taken for experimental purposes. Declare hyperparameters:

  • batch_size: Batch size for running model inference
  • max_seq_length: Maximum sequence length for transformer to enable truncation
  • convert_to_numpy: If enabled will return the embeddings in numpy ,else will keep in torch.Tensor
  • normalize_embeddings:If set to True will enable normalization of embeddings.
  • neighborhood_min_size:This is used for neighborhood_detection method and determines the minimum number of entries in each cluster
  • cutoff_threshold:This is used for neighborhood_detection method and determines the cutoff cosine similarity score to cluster the embeddings.
  • kmeans_max_iter: Hyperparameter for kmeans_detection method signifying nnumber of iterations for convergence.
  • kmeans_random_state:Hyperparameter for kmeans_detection method signifying random initial state.
  • kmeans_no_cluster:Hyperparameter for kmeans_detection method signifying number of cluster.
  • model_name:Transformer model name ,any transformer from Huggingface pretrained library

Call the methods:

  • ClusterTransfomer.model_inference: For creating the embeddings by running inference through any Transformer library (BERT,Albert,Roberta,Distilbert etc.)Returns a torch.Tensor containing the embeddings.
  • ClusterTransformer.neighborhood_detection: For agglomerative clustering from the embeddings created from the model_inference method.Returns a dictionary.
  • ClusterTransformer.kmeans_detection:For Kmeans clustering from the embeddings created from the model_inference method.Returns a dictionary.
  • ClusterTransformer.convert_to_df: Converts the dictionary from the neighborhood_detection/kmeans_detection methods in a dataframe
  • ClusterTransformer.plot_cluster:Used for simple plotting of the clusters for each text topic.

Code Sample

The code steps provided in the tab below, represent all the steps required to be done for creating the clusters. The 'compute_topics' method has the following steps:

  • Instantiate the object of the ClusterTransformer
  • Specify the transformer name from pretrained transformers
  • Specify the hyperparameters
  • Get the embeddings from 'model_inference' method
  • For agglomerative neighborhood detection use 'neighborhood_detection' method
  • For kmeans detection, use the 'kmeans_detection' method
  • For converting the dictionary to a dataframe use the 'convert_to_df' method
  • For optional plotting of the clusters w.r.t corpus samples, use the 'plot_cluster' method
%%time
import ClusterTransformer.ClusterTransformer as cluster_transformer

def compute_topics(transformer_name):
    
    #Instantiate the object
    ct=cluster_transformer.ClusterTransformer()
    #Transformer model for inference
    model_name=transformer_name
    
    #Hyperparameters
    #Hyperparameters for model inference
    batch_size=500
    max_seq_length=64
    convert_to_numpy=False
    normalize_embeddings=False
    
    #Hyperparameters for Agglomerative clustering
    neighborhood_min_size=3
    cutoff_threshold=0.95
    #Hyperparameters for K means clustering
    kmeans_max_iter=100
    kmeans_random_state=42
    kmeans_no_clusters=8
    
    #Sub input data list
    sub_merged_sent=merged_set[:200]
    #Transformer (Longformer) embeddings
    embeddings=ct.model_inference(sub_merged_sent,batch_size,model_name,max_seq_length,normalize_embeddings,convert_to_numpy)
    #Hierarchical agglomerative detection
    output_dict=ct.neighborhood_detection(sub_merged_sent,embeddings,cutoff_threshold,neighborhood_min_size)
    #Kmeans detection
    output_kmeans_dict=ct.kmeans_detection(sub_merged_sent,embeddings,kmeans_no_clusters,kmeans_max_iter,kmeans_random_state)
    #Agglomerative clustering
    neighborhood_detection_df=ct.convert_to_df(output_dict)
    #KMeans clustering 
    kmeans_df=ct.convert_to_df(output_kmeans_dict)
    return neighborhood_detection_df,kmeans_df 

Calling the driver code:

%%time
import matplotlib.pyplot as plt
n_df,k_df=compute_topics('bert-large-uncased')
kg_df=k_df.groupby('Cluster').agg({'Text':'count'}).reset_index()
ng_df=n_df.groupby('Cluster').agg({'Text':'count'}).reset_index()

#Plotting
fig,(ax1,ax2)=plt.subplots(1,2,figsize=(15,5))
rng = np.random.RandomState(0)
s=1000*rng.rand(len(kg_df['Text']))
s1=1000*rng.rand(len(ng_df['Text']))
ax1.scatter(kg_df['Cluster'],kg_df['Text'],s=s,c=kg_df['Cluster'],alpha=0.3)
ax1.set_title('Kmeans clustering')
ax1.set_xlabel('No of clusters')
ax1.set_ylabel('No of topics')
ax2.scatter(ng_df['Cluster'],ng_df['Text'],s=s1,c=ng_df['Cluster'],alpha=0.3)
ax2.set_title('Agglomerative clustering')
ax2.set_xlabel('No of clusters')
ax2.set_ylabel('No of topics')
plt.show()

Samples

Colab-Demo

Colab-Demo

Kaggle Notebook

Quantum Stat Repository

Images

Cluster Images ( Created With Facebook BART)

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

License

MIT

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].