All Projects → squaresLab → VarCLR

squaresLab / VarCLR

Licence: MIT license
VarCLR: Variable Semantic Representation Pre-training via Contrastive Learning

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to VarCLR

TCE
This repository contains the code implementation used in the paper Temporally Coherent Embeddings for Self-Supervised Video Representation Learning (TCE).
Stars: ✭ 51 (+70%)
Mutual labels:  embeddings, contrastive-learning
datastories-semeval2017-task6
Deep-learning model presented in "DataStories at SemEval-2017 Task 6: Siamese LSTM with Attention for Humorous Text Comparison".
Stars: ✭ 20 (-33.33%)
Mutual labels:  embeddings
RadiologyReportEmbedding
Intelligent Word Embeddings of Free-Text Radiology Reports
Stars: ✭ 22 (-26.67%)
Mutual labels:  embeddings
GCL
List of Publications in Graph Contrastive Learning
Stars: ✭ 25 (-16.67%)
Mutual labels:  contrastive-learning
dpar
Neural network transition-based dependency parser (in Rust)
Stars: ✭ 41 (+36.67%)
Mutual labels:  embeddings
mirror-bert
[EMNLP 2021] Mirror-BERT: Converting Pretrained Language Models to universal text encoders without labels.
Stars: ✭ 56 (+86.67%)
Mutual labels:  contrastive-learning
info-nce-pytorch
PyTorch implementation of the InfoNCE loss for self-supervised learning.
Stars: ✭ 160 (+433.33%)
Mutual labels:  contrastive-learning
deep-char-cnn-lstm
Deep Character CNN LSTM Encoder with Classification and Similarity Models
Stars: ✭ 20 (-33.33%)
Mutual labels:  embeddings
Archived-SANSA-ML
SANSA Machine Learning Layer
Stars: ✭ 39 (+30%)
Mutual labels:  embeddings
HEAPUtil
Code for the RA-L (IROS) 2021 paper "A Hierarchical Dual Model of Environment- and Place-Specific Utility for Visual Place Recognition"
Stars: ✭ 46 (+53.33%)
Mutual labels:  contrastive-learning
DiGCL
The PyTorch implementation of Directed Graph Contrastive Learning (DiGCL), NeurIPS-2021
Stars: ✭ 27 (-10%)
Mutual labels:  contrastive-learning
info-retrieval
Information Retrieval in High Dimensional Data (class deliverables)
Stars: ✭ 33 (+10%)
Mutual labels:  embeddings
codesnippetsearch
Neural bag of words code search implementation using PyTorch and data from the CodeSearchNet project.
Stars: ✭ 67 (+123.33%)
Mutual labels:  embeddings
awesome-graph-self-supervised-learning-based-recommendation
A curated list of awesome graph & self-supervised-learning-based recommendation.
Stars: ✭ 37 (+23.33%)
Mutual labels:  contrastive-learning
I-CTF-FWHIBBIT
Challenges source code
Stars: ✭ 41 (+36.67%)
Mutual labels:  source-code
android-source-codes
⚙️ Code analysis of common Android projects and components.
Stars: ✭ 59 (+96.67%)
Mutual labels:  source-code
CVC
CVC: Contrastive Learning for Non-parallel Voice Conversion (INTERSPEECH 2021, in PyTorch)
Stars: ✭ 45 (+50%)
Mutual labels:  contrastive-learning
SoCo
[NeurIPS 2021 Spotlight] Aligning Pretraining for Detection via Object-Level Contrastive Learning
Stars: ✭ 125 (+316.67%)
Mutual labels:  contrastive-learning
Deep-Learning-Experiments-implemented-using-Google-Colab
Colab Compatible FastAI notebooks for NLP and Computer Vision Datasets
Stars: ✭ 16 (-46.67%)
Mutual labels:  embeddings
phpBolt
Best php encoder - free | Encrypt php source code
Stars: ✭ 113 (+276.67%)
Mutual labels:  source-code
   

Unittest GitHub stars GitHub license Black

VarCLR: Variable Representation Pre-training via Contrastive Learning

New: Paper accepted by ICSE 2022. Preprint at arXiv!

This repository contains code and pre-trained models for VarCLR, a contrastive learning based approach for learning semantic representations of variable names that effectively captures variable similarity, with state-of-the-art results on IdBench@ICSE2021.

Step 0: Install

pip install -e .

Step 1: Load a Pre-trained VarCLR Model

from varclr.models.model import Encoder
model = Encoder.from_pretrained("varclr-codebert")

Step 2: VarCLR Variable Embeddings

Get embedding of one variable

emb = model.encode("squareslab")
print(emb.shape)
# torch.Size([1, 768])

Get embeddings of list of variables (supports batching)

emb = model.encode(["squareslab", "strudel"])
print(emb.shape)
# torch.Size([2, 768])

Step 2: Get VarCLR Similarity Scores

Get similarity scores of N variable pairs

print(model.score("squareslab", "strudel"))
# [0.42812108993530273]
print(model.score(["squareslab", "average", "max", "max"], ["strudel", "mean", "min", "maximum"]))
# [0.42812108993530273, 0.8849745988845825, 0.8035818338394165, 0.889922022819519]

Get pairwise (N * M) similarity scores from two lists of variables

variable_list = ["squareslab", "strudel", "neulab"]
print(model.cross_score("squareslab", variable_list))
# [[1.0000007152557373, 0.4281214475631714, 0.7207341194152832]]
print(model.cross_score(variable_list, variable_list))
# [[1.0000007152557373, 0.4281214475631714, 0.7207341194152832],
#  [0.4281214475631714, 1.0000004768371582, 0.549992561340332],
#  [0.7207341194152832, 0.549992561340332, 1.000000238418579]]

Step 3: Reproduce IdBench Benchmark Results

Load the IdBench benchmark

from varclr.benchmarks import Benchmark

# Similarity on IdBench-Medium
b1 = Benchmark.build("idbench", variant="medium", metric="similarity")
# Relatedness on IdBench-Large
b2 = Benchmark.build("idbench", variant="large", metric="relatedness")

Compute VarCLR scores and evaluate

id1_list, id2_list = b1.get_inputs()
predicted = model.score(id1_list, id2_list)
print(b1.evaluate(predicted))
# {'spearmanr': 0.5248567181503295, 'pearsonr': 0.5249843473193132}

print(b2.evaluate(model.score(*b2.get_inputs())))
# {'spearmanr': 0.8012168379981921, 'pearsonr': 0.8021791703187449}

Let's compare with the original CodeBERT

codebert = Encoder.from_pretrained("codebert")
print(b1.evaluate(codebert.score(*b1.get_inputs())))
# {'spearmanr': 0.2056582946575104, 'pearsonr': 0.1995058696927054}
print(b2.evaluate(codebert.score(*b2.get_inputs())))
# {'spearmanr': 0.3909218857993804, 'pearsonr': 0.3378219622284688}

Pre-train your own VarCLR models

You can pretrain and get the same VarCLR model variants with the following code.

python -m varclr.pretrain --model avg --name varclr-avg
python -m varclr.pretrain --model lstm --name varclr-lstm
python -m varclr.pretrain --model bert --name varclr-codebert --sp-model split --last-n-layer-output 4 --batch-size 64 --lr 1e-5 --epochs 1

The training progress and test results will be presented in the wandb dashboard. For reference, our training curves look like the following:

training progress

Results on IdBench benchmarks

Similarity

Method Small Medium Large
FT-SG 0.30 0.29 0.28
LV 0.32 0.30 0.30
FT-cbow 0.35 0.38 0.38
VarCLR-Avg 0.47 0.45 0.44
VarCLR-LSTM 0.50 0.49 0.49
VarCLR-CodeBERT 0.53 0.53 0.51
Combined-IdBench 0.48 0.59 0.57
Combined-VarCLR 0.66 0.65 0.62

Relatedness

Method Small Medium Large
LV 0.48 0.47 0.48
FT-SG 0.70 0.71 0.68
FT-cbow 0.72 0.74 0.73
VarCLR-Avg 0.67 0.66 0.66
VarCLR-LSTM 0.71 0.70 0.69
VarCLR-CodeBERT 0.79 0.79 0.80
Combined-IdBench 0.71 0.78 0.79
Combined-VarCLR 0.79 0.81 0.85

Cite

If you find VarCLR useful in your research, please cite our paper@ICSE2022:

@inproceedings{ChenVarCLR2022,
  author = {Chen, Qibin and Lacomis, Jeremy and Schwartz, Edward J. and Neubig, Graham and Vasilescu, Bogdan and {Le~Goues}, Claire},
  title = {{VarCLR}: {Variable} Semantic Representation Pre-training via Contrastive Learning},
  booktitle = {International Conference on Software Engineering},
  year = {2022},
  series = {ICSE '22}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].