All Projects → parasj → Contracode

parasj / Contracode

Licence: apache-2.0
Contrastive Code Representation Learning: functionality-based JavaScript embeddings through self-supervised learning

Projects that are alternatives of or similar to Contracode

Numpile
A tiny 1000 line LLVM-based numeric specializer for scientific Python code.
Stars: ✭ 341 (+416.67%)
Mutual labels:  compiler, jupyter-notebook
Lfortran
Official mirror of https://gitlab.com/lfortran/lfortran. Please submit pull requests (PR) there. Any PR sent here will be closed automatically.
Stars: ✭ 220 (+233.33%)
Mutual labels:  compiler, jupyter-notebook
448project
High Frequency Trading
Stars: ✭ 66 (+0%)
Mutual labels:  jupyter-notebook
Ml Nlp
此项目是机器学习(Machine Learning)、深度学习(Deep Learning)、NLP面试中常考到的知识点和代码实现,也是作为一个算法工程师必会的理论基础知识。
Stars: ✭ 10,826 (+16303.03%)
Mutual labels:  jupyter-notebook
Emotiw2016
Stars: ✭ 66 (+0%)
Mutual labels:  jupyter-notebook
Introduction To Data Science
本Repository为中国人民大学朝乐门老师开源课程——《数据科学导论》
Stars: ✭ 66 (+0%)
Mutual labels:  jupyter-notebook
Shaderconductor
ShaderConductor is a tool designed for cross-compiling HLSL to other shading languages
Stars: ✭ 1,146 (+1636.36%)
Mutual labels:  compiler
Bayesian Analysis With Python Second Edition
Bayesian Analysis with Python - Second Edition, published by Packt
Stars: ✭ 66 (+0%)
Mutual labels:  jupyter-notebook
Hacktoberfest2020 Expert
Hacktoberfest 2020. Don't forget to spread love and if you like give me a ⭐️
Stars: ✭ 67 (+1.52%)
Mutual labels:  jupyter-notebook
Tensorflow deep taylor lrp
Layerwise Relevance Propagation with Deep Taylor Series in TensorFlow
Stars: ✭ 66 (+0%)
Mutual labels:  jupyter-notebook
Compiler Explorer
Run compilers interactively from your web browser and interact with the assembly
Stars: ✭ 9,844 (+14815.15%)
Mutual labels:  compiler
Pandas Tutorial
Tutorial on Using Pandas
Stars: ✭ 66 (+0%)
Mutual labels:  jupyter-notebook
Short Text Classification
SVM, FastText, TextCNN, BiGRU, CNN-BiGRU在短分本分类上的对比
Stars: ✭ 66 (+0%)
Mutual labels:  jupyter-notebook
Web develop
《Python Web开发实战》书中源码
Stars: ✭ 1,146 (+1636.36%)
Mutual labels:  jupyter-notebook
Sdc course
Short course about self-driving cars
Stars: ✭ 66 (+0%)
Mutual labels:  jupyter-notebook
Mimic Workshop
Introduction to MIMIC-III, the Critical Care Database
Stars: ✭ 65 (-1.52%)
Mutual labels:  jupyter-notebook
Credit Card Score
基于Python的申请信用评分卡模型分析
Stars: ✭ 66 (+0%)
Mutual labels:  jupyter-notebook
Charly Vm
Fibers, Closures, C-Module System | NaN-boxing, bytecode-VM written in C++
Stars: ✭ 66 (+0%)
Mutual labels:  compiler
Jupyter tfbook
Jupyter Notebooks for TensorFlow Book
Stars: ✭ 66 (+0%)
Mutual labels:  jupyter-notebook
Face Mask Detection
In this, I am attaching my code for building a CNN model to detect if a person is wearing face mask or not using the webcam of their PC.
Stars: ✭ 67 (+1.52%)
Mutual labels:  jupyter-notebook

Contrastive Code Representation Learning

By Paras Jain, Ajay Jain, Tianjun Zhang, Pieter Abbeel, Joseph E. Gonzalez and Ion Stoica (website)

Learning functionality-based representations of programs

Machine-aided programming tools such as type predictors and code summarizers are increasingly learning-based. However, most code representation learning approaches rely on supervised learning with task-specific annotated datasets. We propose Contrastive Code Representation Learning (ContraCode), a self-supervised algorithm for learning task-agnostic semantic representations of programs via contrastive learning.

Our approach uses no human-provided labels, relying only on the raw text of programs. In particular, we design an unsupervised pretext task by generating textually divergent copies of source functions via automated source-to-source compiler transforms that preserve semantics. We train a neural model to identify variants of an anchor program within a large batch of negatives. To solve this task, the network must extract program features representing the functionality, not form, of the program. This is the first application of instance discrimination to code representation learning to our knowledge. We pre-train ContraCode over 1.8m unannotated JavaScript methods mined from GitHub. ContraCode pre-training improves code summarization accuracy by 7.9% over supervised approaches and 4.8% over BERT pre-training. Moreover, our approach is agnostic to model architecture; for a type prediction task, contrastive pre-training consistently improves the accuracy of existing baselines.

This repository contains code to augment JavaScript programs with code transformations, pre-train LSTM and Transformer models with ContraCode, and to finetune the models on downstream tasks.

Installation

Dependencies: Python 3.7, NodeJS, NPM

$ npm install
$ pip install -e "."
$ python scripts/download_data.py

Data and checkpoints

Download the data subfolder from this Google Drive link and place at the root of the repository. This folder contains training and evaluation data, vocabularies and model checkpoints.

Pretraining models with ContraCode

Pretrain Bidirectional LSTM with ContraCode (10001 should be an available port, change if the port is in use):

python representjs/pretrain_distributed.py pretrain_lstm2l_hidden \
  --num_epochs=200 --batch_size=512 --lr=1e-4 --num_workers=4 \
  --subword_regularization_alpha 0.1 --program_mode contrastive --label_mode contrastive --save_every 5000 \
  --train_filepath=data/codesearchnet_javascript/javascript_augmented.pickle.gz \
  --spm_filepath=data/codesearchnet_javascript/csnjs_8k_9995p_unigram_url.model \
  --min_alternatives 2 --dist_url tcp://localhost:10001 --rank 0 \
  --encoder_type lstm --lstm_project_mode hidden --n_encoder_layers 2

Pretrain Transformer with ContraCode:

python representjs/pretrain_distributed.py pretrain_transformer \
  --num_epochs=200 --batch_size=96 --lr=1e-4 --num_workers=6 \
  --subword_regularization_alpha 0.1 --program_mode contrastive --label_mode contrastive --save_every 5000 \
  --train_filepath=/dev/shm/codesearchnet_javascript/javascript_augmented.pickle.gz \
  --spm_filepath=/dev/shm/codesearchnet_javascript/csnjs_8k_9995p_unigram_url.model \
  --min_alternatives 1 --dist_url tcp://localhost:10001 --rank 0

Pretrain Transformer with hybrid MLM + ContraCode objective:

python representjs/pretrain_distributed.py pretrain_transformer_hybrid \
  --num_epochs=200 --batch_size=96 --lr=4e-4 --num_workers=8 \
  --subword_regularization_alpha 0. --program_mode contrastive --loss_mode hybrid --save_every 5000 \
  --train_filepath=data/codesearchnet_javascript/javascript_augmented.pickle.gz \
  --spm_filepath=data/codesearchnet_javascript/csnjs_8k_9995p_unigram_url.model \
  --min_alternatives 1 --dist_url "tcp://localhost:10001" --rank 0

Finetuning and evaluating on downstream type prediction task

Commands to reproduce key type prediction results are provided below. In you are using pretraining checkpoints from the released checkpoints in the Google Drive, these commands should work without modification. However, if you pretrained the model from scratch, you will need to update the --resume_path argument.

Checkpoint paths if you pre-trained a model from scratch:
  • data/ft/ckpt_lstm_ft_types.pth becomes data/runs/types_contracode/ckpt_best.pth
  • data/pretrain/ckpt_transformer_ft_types.pth becomes data/runs/types_contracode_transformer/ckpt_best.pth
  • data/ft/ckpt_transformer_hybrid_ft_types.pth becomes data/runs/types_hybrid_transformer/ckpt_best.pth
  • data/ft/ckpt_transformer_ft_names.pth becomes data/runs/names_ft/ckpt_best.pth

Type prediction with an LSTM (pretrained with ContraCode)

Evaluate our finetuned Bidirectional LSTM (Table 2, DeepTyper with ContraCode pre-training):

python representjs/type_prediction.py eval --eval_filepath data/types/test_projects_gold_filtered.json --type_vocab_filepath data/types/target_wl --spm_filepath data/codesearchnet_javascript/csnjs_8k_9995p_unigram_url.model --num_workers 4 --batch_size 1 --max_seq_len -1 --no_output_attention True --encoder_type lstm --n_encoder_layers 2 --resume_path data/ft/ckpt_lstm_ft_types.pth

Finetune Bidirectional LSTM pretrained with ContraCode:

python representjs/type_prediction.py train --run_name types_contracode --train_filepath data/types/train_nounk.txt --eval_filepath data/types/valid_nounk.txt --type_vocab_filepath data/types/target_wl --spm_filepath data/codesearchnet_javascript/csnjs_8k_9995p_unigram_url.model --num_workers 4 --batch_size 16 --max_seq_len 2048 --max_eval_seq_len 2048 --lr 1e-3 --no_output_attention True --encoder_type lstm --n_encoder_layers 2 --warmup_steps 10000 --pretrain_resume_path data/pretrain/ckpt_lstm_pretrain_20k.pth --pretrain_resume_encoder_name encoder_q

Type prediction with a Transformer (pretrained with ContraCode)

Evaluate our finetuned Transformer (Table 2, Transformer with ContraCode pre-training):

python representjs/type_prediction.py eval --eval_filepath data/types/test_projects_gold_filtered.json --type_vocab_filepath data/types/target_wl --spm_filepath data/codesearchnet_javascript/csnjs_8k_9995p_unigram_url.model --num_workers 4 --batch_size 1 --max_seq_len -1 --resume_path data/pretrain/ckpt_transformer_ft_types.pth

Finetune Transformer pretrained with ContraCode:

python representjs/type_prediction.py train --run_name types_contracode_transformer --train_filepath data/types/train_nounk.txt --eval_filepath data/types/valid_nounk.txt --type_vocab_filepath data/types/target_wl --spm_filepath data/codesearchnet_javascript/csnjs_8k_9995p_unigram_url.model	--num_workers 4 --batch_size 16 --max_seq_len 2048 --max_eval_seq_len 2048 --pretrain_resume_path data/pretrain/ckpt_transformer_pretrain_240k.pth --pretrain_resume_encoder_name encoder_q --lr 1e-4

Type prediction with a hybrid Transformer (pretraining with both MLM and ContraCode)

Evaluate our finetuned hybrid Transformer (Table 2, Transformer (RoBERTa MLM pre-training) with ContraCode pre-training):

python representjs/type_prediction.py eval --eval_filepath data/types/test_projects_gold_filtered.json --type_vocab_filepath data/types/target_wl --spm_filepath data/codesearchnet_javascript/csnjs_8k_9995p_unigram_url.model --num_workers 4 --batch_size 1 --max_seq_len -1 --resume_path data/ft/ckpt_transformer_hybrid_ft_types.pth

Finetune Transformer after hybrid pretraining:

python representjs/type_prediction.py train --run_name types_hybrid_transformer --train_filepath data/types/train_nounk.txt --eval_filepath data/types/valid_nounk.txt --type_vocab_filepath data/types/target_wl --spm_filepath data/codesearchnet_javascript/csnjs_8k_9995p_unigram_url.model	--num_workers 4 --batch_size 16 --max_seq_len 2048 --max_eval_seq_len 2048 --pretrain_resume_path data/pretrain/ckpt_transformer_hybrid_pretrain_240k.pth --pretrain_resume_encoder_name encoder_q --lr 1e-4

Finetuning and evaluating on downstream method naming task

Evaluate (Table 3, Transformer + ContraCode + augmentation):

python representjs/main.py test --batch_size 64 --num_workers 8 --n_decoder_layers 4 \
  --checkpoint_file data/ft/ckpt_transformer_ft_names.pth \
  --test_filepath data/codesearchnet_javascript/javascript_test_0.jsonl.gz \
  --spm_filepath data/codesearchnet_javascript/csnjs_8k_9995p_unigram_url.model

Finetune:

python representjs/main.py train --run_name names_ft \
  --program_mode identity --label_mode identifier --n_decoder_layers=4 --subword_regularization_alpha 0 \
  --num_epochs 100 --save_every 5 --batch_size 32 --num_workers 4 --lr 1e-4 \
  --train_filepath data/codesearchnet_javascript/javascript_train_supervised.jsonl.gz \
  --eval_filepath data/codesearchnet_javascript/javascript_valid_0.jsonl.gz \
  --resume_path data/pretrain/ckpt_transformer_pretrain_20k.pth

Citation

If you find this code or our paper relevant to your work, please cite our arXiv paper:

@article{jain2020contrastive,
  title={Contrastive Code Representation Learning},
  author={Paras Jain and Ajay Jain and Tianjun Zhang
  and Pieter Abbeel and Joseph E. Gonzalez and Ion Stoica},
  year={2020},
  journal={arXiv preprint}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].