Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → parasj → Contracode

parasj / Contracode

Licence: apache-2.0

Contrastive Code Representation Learning: functionality-based JavaScript embeddings through self-supervised learning

Labels

jupyter-notebook deep-learning machine-learning pytorch compiler

Projects that are alternatives of or similar to Contracode

Numpile

A tiny 1000 line LLVM-based numeric specializer for scientific Python code.

Stars: ✭ 341 (+416.67%)

Mutual labels: compiler, jupyter-notebook

Lfortran

Official mirror of https://gitlab.com/lfortran/lfortran. Please submit pull requests (PR) there. Any PR sent here will be closed automatically.

Stars: ✭ 220 (+233.33%)

Mutual labels: compiler, jupyter-notebook

448project

High Frequency Trading

Stars: ✭ 66 (+0%)

Mutual labels: jupyter-notebook

Ml Nlp

此项目是机器学习(Machine Learning)、深度学习(Deep Learning)、NLP面试中常考到的知识点和代码实现，也是作为一个算法工程师必会的理论基础知识。

Stars: ✭ 10,826 (+16303.03%)

Mutual labels: jupyter-notebook

Emotiw2016

Stars: ✭ 66 (+0%)

Mutual labels: jupyter-notebook

Introduction To Data Science

本Repository为中国人民大学朝乐门老师开源课程——《数据科学导论》

Stars: ✭ 66 (+0%)

Mutual labels: jupyter-notebook

Shaderconductor

ShaderConductor is a tool designed for cross-compiling HLSL to other shading languages

Stars: ✭ 1,146 (+1636.36%)

Mutual labels: compiler

Bayesian Analysis With Python Second Edition

Bayesian Analysis with Python - Second Edition, published by Packt

Stars: ✭ 66 (+0%)

Mutual labels: jupyter-notebook

Hacktoberfest2020 Expert

Hacktoberfest 2020. Don't forget to spread love and if you like give me a ⭐️

Stars: ✭ 67 (+1.52%)

Mutual labels: jupyter-notebook

Tensorflow deep taylor lrp

Layerwise Relevance Propagation with Deep Taylor Series in TensorFlow

Stars: ✭ 66 (+0%)

Mutual labels: jupyter-notebook

Compiler Explorer

Run compilers interactively from your web browser and interact with the assembly

Stars: ✭ 9,844 (+14815.15%)

Mutual labels: compiler

Pandas Tutorial

Tutorial on Using Pandas

Stars: ✭ 66 (+0%)

Mutual labels: jupyter-notebook

Short Text Classification

SVM, FastText, TextCNN, BiGRU, CNN-BiGRU在短分本分类上的对比

Stars: ✭ 66 (+0%)

Mutual labels: jupyter-notebook

Web develop

《Python Web开发实战》书中源码

Stars: ✭ 1,146 (+1636.36%)

Mutual labels: jupyter-notebook

Sdc course

Short course about self-driving cars

Stars: ✭ 66 (+0%)

Mutual labels: jupyter-notebook

Mimic Workshop

Introduction to MIMIC-III, the Critical Care Database

Stars: ✭ 65 (-1.52%)

Mutual labels: jupyter-notebook

Credit Card Score

基于Python的申请信用评分卡模型分析

Stars: ✭ 66 (+0%)

Mutual labels: jupyter-notebook

Charly Vm

Fibers, Closures, C-Module System | NaN-boxing, bytecode-VM written in C++

Stars: ✭ 66 (+0%)

Mutual labels: compiler

Jupyter tfbook

Jupyter Notebooks for TensorFlow Book

Stars: ✭ 66 (+0%)

Mutual labels: jupyter-notebook

Face Mask Detection

In this, I am attaching my code for building a CNN model to detect if a person is wearing face mask or not using the webcam of their PC.

Stars: ✭ 67 (+1.52%)

Mutual labels: jupyter-notebook

View All Similar Projects ➔

Contrastive Code Representation Learning

By Paras Jain, Ajay Jain, Tianjun Zhang, Pieter Abbeel, Joseph E. Gonzalez and Ion Stoica (website)

Learning functionality-based representations of programs

Machine-aided programming tools such as type predictors and code summarizers are increasingly learning-based. However, most code representation learning approaches rely on supervised learning with task-specific annotated datasets. We propose Contrastive Code Representation Learning (ContraCode), a self-supervised algorithm for learning task-agnostic semantic representations of programs via contrastive learning.

Our approach uses no human-provided labels, relying only on the raw text of programs. In particular, we design an unsupervised pretext task by generating textually divergent copies of source functions via automated source-to-source compiler transforms that preserve semantics. We train a neural model to identify variants of an anchor program within a large batch of negatives. To solve this task, the network must extract program features representing the functionality, not form, of the program. This is the first application of instance discrimination to code representation learning to our knowledge. We pre-train ContraCode over 1.8m unannotated JavaScript methods mined from GitHub. ContraCode pre-training improves code summarization accuracy by 7.9% over supervised approaches and 4.8% over BERT pre-training. Moreover, our approach is agnostic to model architecture; for a type prediction task, contrastive pre-training consistently improves the accuracy of existing baselines.

This repository contains code to augment JavaScript programs with code transformations, pre-train LSTM and Transformer models with ContraCode, and to finetune the models on downstream tasks.

Installation

Dependencies: Python 3.7, NodeJS, NPM

$ npm install
$ pip install -e "."
$ python scripts/download_data.py

Data and checkpoints

Download the data subfolder from this Google Drive link and place at the root of the repository. This folder contains training and evaluation data, vocabularies and model checkpoints.

Pretraining models with ContraCode

Pretrain Bidirectional LSTM with ContraCode (10001 should be an available port, change if the port is in use):

python representjs/pretrain_distributed.py pretrain_lstm2l_hidden \
  --num_epochs=200 --batch_size=512 --lr=1e-4 --num_workers=4 \
  --subword_regularization_alpha 0.1 --program_mode contrastive --label_mode contrastive --save_every 5000 \
  --train_filepath=data/codesearchnet_javascript/javascript_augmented.pickle.gz \
  --spm_filepath=data/codesearchnet_javascript/csnjs_8k_9995p_unigram_url.model \
  --min_alternatives 2 --dist_url tcp://localhost:10001 --rank 0 \
  --encoder_type lstm --lstm_project_mode hidden --n_encoder_layers 2

Pretrain Transformer with ContraCode:

python representjs/pretrain_distributed.py pretrain_transformer \
  --num_epochs=200 --batch_size=96 --lr=1e-4 --num_workers=6 \
  --subword_regularization_alpha 0.1 --program_mode contrastive --label_mode contrastive --save_every 5000 \
  --train_filepath=/dev/shm/codesearchnet_javascript/javascript_augmented.pickle.gz \
  --spm_filepath=/dev/shm/codesearchnet_javascript/csnjs_8k_9995p_unigram_url.model \
  --min_alternatives 1 --dist_url tcp://localhost:10001 --rank 0

Pretrain Transformer with hybrid MLM + ContraCode objective:

python representjs/pretrain_distributed.py pretrain_transformer_hybrid \
  --num_epochs=200 --batch_size=96 --lr=4e-4 --num_workers=8 \
  --subword_regularization_alpha 0. --program_mode contrastive --loss_mode hybrid --save_every 5000 \
  --train_filepath=data/codesearchnet_javascript/javascript_augmented.pickle.gz \
  --spm_filepath=data/codesearchnet_javascript/csnjs_8k_9995p_unigram_url.model \
  --min_alternatives 1 --dist_url "tcp://localhost:10001" --rank 0

Finetuning and evaluating on downstream type prediction task

Commands to reproduce key type prediction results are provided below. In you are using pretraining checkpoints from the released checkpoints in the Google Drive, these commands should work without modification. However, if you pretrained the model from scratch, you will need to update the --resume_path argument.

Checkpoint paths if you pre-trained a model from scratch:

data/ft/ckpt_lstm_ft_types.pth becomes data/runs/types_contracode/ckpt_best.pth
data/pretrain/ckpt_transformer_ft_types.pth becomes data/runs/types_contracode_transformer/ckpt_best.pth
data/ft/ckpt_transformer_hybrid_ft_types.pth becomes data/runs/types_hybrid_transformer/ckpt_best.pth
data/ft/ckpt_transformer_ft_names.pth becomes data/runs/names_ft/ckpt_best.pth

Type prediction with an LSTM (pretrained with ContraCode)

Evaluate our finetuned Bidirectional LSTM (Table 2, DeepTyper with ContraCode pre-training):

python representjs/type_prediction.py eval --eval_filepath data/types/test_projects_gold_filtered.json --type_vocab_filepath data/types/target_wl --spm_filepath data/codesearchnet_javascript/csnjs_8k_9995p_unigram_url.model --num_workers 4 --batch_size 1 --max_seq_len -1 --no_output_attention True --encoder_type lstm --n_encoder_layers 2 --resume_path data/ft/ckpt_lstm_ft_types.pth

Finetune Bidirectional LSTM pretrained with ContraCode:

python representjs/type_prediction.py train --run_name types_contracode --train_filepath data/types/train_nounk.txt --eval_filepath data/types/valid_nounk.txt --type_vocab_filepath data/types/target_wl --spm_filepath data/codesearchnet_javascript/csnjs_8k_9995p_unigram_url.model --num_workers 4 --batch_size 16 --max_seq_len 2048 --max_eval_seq_len 2048 --lr 1e-3 --no_output_attention True --encoder_type lstm --n_encoder_layers 2 --warmup_steps 10000 --pretrain_resume_path data/pretrain/ckpt_lstm_pretrain_20k.pth --pretrain_resume_encoder_name encoder_q

Type prediction with a Transformer (pretrained with ContraCode)

Evaluate our finetuned Transformer (Table 2, Transformer with ContraCode pre-training):

python representjs/type_prediction.py eval --eval_filepath data/types/test_projects_gold_filtered.json --type_vocab_filepath data/types/target_wl --spm_filepath data/codesearchnet_javascript/csnjs_8k_9995p_unigram_url.model --num_workers 4 --batch_size 1 --max_seq_len -1 --resume_path data/pretrain/ckpt_transformer_ft_types.pth

Finetune Transformer pretrained with ContraCode:

python representjs/type_prediction.py train --run_name types_contracode_transformer --train_filepath data/types/train_nounk.txt --eval_filepath data/types/valid_nounk.txt --type_vocab_filepath data/types/target_wl --spm_filepath data/codesearchnet_javascript/csnjs_8k_9995p_unigram_url.model	--num_workers 4 --batch_size 16 --max_seq_len 2048 --max_eval_seq_len 2048 --pretrain_resume_path data/pretrain/ckpt_transformer_pretrain_240k.pth --pretrain_resume_encoder_name encoder_q --lr 1e-4

Type prediction with a hybrid Transformer (pretraining with both MLM and ContraCode)

Evaluate our finetuned hybrid Transformer (Table 2, Transformer (RoBERTa MLM pre-training) with ContraCode pre-training):

python representjs/type_prediction.py eval --eval_filepath data/types/test_projects_gold_filtered.json --type_vocab_filepath data/types/target_wl --spm_filepath data/codesearchnet_javascript/csnjs_8k_9995p_unigram_url.model --num_workers 4 --batch_size 1 --max_seq_len -1 --resume_path data/ft/ckpt_transformer_hybrid_ft_types.pth

Finetune Transformer after hybrid pretraining:

python representjs/type_prediction.py train --run_name types_hybrid_transformer --train_filepath data/types/train_nounk.txt --eval_filepath data/types/valid_nounk.txt --type_vocab_filepath data/types/target_wl --spm_filepath data/codesearchnet_javascript/csnjs_8k_9995p_unigram_url.model	--num_workers 4 --batch_size 16 --max_seq_len 2048 --max_eval_seq_len 2048 --pretrain_resume_path data/pretrain/ckpt_transformer_hybrid_pretrain_240k.pth --pretrain_resume_encoder_name encoder_q --lr 1e-4

Finetuning and evaluating on downstream method naming task

Evaluate (Table 3, Transformer + ContraCode + augmentation):

python representjs/main.py test --batch_size 64 --num_workers 8 --n_decoder_layers 4 \
  --checkpoint_file data/ft/ckpt_transformer_ft_names.pth \
  --test_filepath data/codesearchnet_javascript/javascript_test_0.jsonl.gz \
  --spm_filepath data/codesearchnet_javascript/csnjs_8k_9995p_unigram_url.model

Finetune:

python representjs/main.py train --run_name names_ft \
  --program_mode identity --label_mode identifier --n_decoder_layers=4 --subword_regularization_alpha 0 \
  --num_epochs 100 --save_every 5 --batch_size 32 --num_workers 4 --lr 1e-4 \
  --train_filepath data/codesearchnet_javascript/javascript_train_supervised.jsonl.gz \
  --eval_filepath data/codesearchnet_javascript/javascript_valid_0.jsonl.gz \
  --resume_path data/pretrain/ckpt_transformer_pretrain_20k.pth

Citation

If you find this code or our paper relevant to your work, please cite our arXiv paper:

@article{jain2020contrastive,
  title={Contrastive Code Representation Learning},
  author={Paras Jain and Ajay Jain and Tianjun Zhang
  and Pieter Abbeel and Joseph E. Gonzalez and Ion Stoica},
  year={2020},
  journal={arXiv preprint}
}

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 66

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗