All Projects → michiyasunaga → BIFI

michiyasunaga / BIFI

Licence: MIT license
[ICML 2021] Break-It-Fix-It: Unsupervised Learning for Program Repair

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects
Cuda
1817 projects
C++
36643 projects - #6 most used programming language
lua
6591 projects
Batchfile
5799 projects

Projects that are alternatives of or similar to BIFI

KD3A
Here is the official implementation of the model KD3A in paper "KD3A: Unsupervised Multi-Source Decentralized Domain Adaptation via Knowledge Distillation".
Stars: ✭ 63 (-14.86%)
Mutual labels:  unsupervised-learning, domain-adaptation
Awesome Transfer Learning
Best transfer learning and domain adaptation resources (papers, tutorials, datasets, etc.)
Stars: ✭ 1,349 (+1722.97%)
Mutual labels:  unsupervised-learning, domain-adaptation
Transferlearning
Transfer learning / domain adaptation / domain generalization / multi-task learning etc. Papers, codes, datasets, applications, tutorials.-迁移学习
Stars: ✭ 8,481 (+11360.81%)
Mutual labels:  unsupervised-learning, domain-adaptation
TA3N
[ICCV 2019 Oral] TA3N: https://github.com/cmhungsteve/TA3N (Most updated repo)
Stars: ✭ 45 (-39.19%)
Mutual labels:  unsupervised-learning, domain-adaptation
Mmt
[ICLR-2020] Mutual Mean-Teaching: Pseudo Label Refinery for Unsupervised Domain Adaptation on Person Re-identification.
Stars: ✭ 345 (+366.22%)
Mutual labels:  unsupervised-learning, domain-adaptation
Deep-Unsupervised-Domain-Adaptation
Pytorch implementation of four neural network based domain adaptation techniques: DeepCORAL, DDC, CDAN and CDAN+E. Evaluated on benchmark dataset Office31.
Stars: ✭ 50 (-32.43%)
Mutual labels:  unsupervised-learning, domain-adaptation
django-i18nfield
Store internationalized strings in Django models with full forms support
Stars: ✭ 32 (-56.76%)
Mutual labels:  translation
NTFk.jl
Unsupervised Machine Learning: Nonnegative Tensor Factorization + k-means clustering
Stars: ✭ 36 (-51.35%)
Mutual labels:  unsupervised-learning
textstat
Ruby gem to calculate statistics from text to determine readability, complexity and grade level of a particular corpus.
Stars: ✭ 25 (-66.22%)
Mutual labels:  translation
Natural-language-understanding-papers
NLU: domain-intent-slot; text2SQL
Stars: ✭ 77 (+4.05%)
Mutual labels:  domain-adaptation
mobility-actiontext
Translate Rails Action Text rich text with Mobility.
Stars: ✭ 27 (-63.51%)
Mutual labels:  translation
RolX
An alternative implementation of Recursive Feature and Role Extraction (KDD11 & KDD12)
Stars: ✭ 52 (-29.73%)
Mutual labels:  unsupervised-learning
dbscan
DBSCAN Clustering Algorithm C# Implementation
Stars: ✭ 38 (-48.65%)
Mutual labels:  unsupervised-learning
mastodon-translate
Tampermonkey Script to add option to Translate Toots
Stars: ✭ 20 (-72.97%)
Mutual labels:  translation
proto
Proto-RL: Reinforcement Learning with Prototypical Representations
Stars: ✭ 67 (-9.46%)
Mutual labels:  unsupervised-learning
Caiyun
Clean, only translation (Chrome/Firefox extension)
Stars: ✭ 222 (+200%)
Mutual labels:  translation
SimCLR
Pytorch implementation of "A Simple Framework for Contrastive Learning of Visual Representations"
Stars: ✭ 65 (-12.16%)
Mutual labels:  unsupervised-learning
translations
Translations for Snipaste
Stars: ✭ 56 (-24.32%)
Mutual labels:  translation
Effective-Java-3rd-Edition-zh
📖 Effective Java (Third Edition) | Effective Java(第三版)翻译计划稿
Stars: ✭ 57 (-22.97%)
Mutual labels:  translation
bem-flashcards
Simple single-page flashcards application based on the bem-core/bem-history and BEM methodology
Stars: ✭ 19 (-74.32%)
Mutual labels:  translation

Break-It-Fix-It: Learning to Repair Programs from Unlabeled Data

This repo provides the source code & data of our paper: Break-It-Fix-It: Unsupervised Learning for Program Repair (ICML 2021).

@InProceedings{yasunaga2021break,
  author =  {Michihiro Yasunaga and Percy Liang},
  title =   {Break-It-Fix-It: Unsupervised Learning for Program Repair},
  year =    {2021},  
  booktitle = {International Conference on Machine Learning (ICML)},  
}

Problem: Repair Task

Our approach: BIFI

0. Dependencies

Specifically, run the following commands to create a conda environment (assuming CUDA10.1):

conda create -n BIFI python=3.7.7
conda activate BIFI
pip install tqdm
pip install torch==1.4.0 torchvision==0.5.0
cd utils/fairseq
pip install -e .
pip numpy==1.20.1 editdistance

1. Download Data

Download all the data from here (data.zip) and unzip it (note: 67GB when compressed, 400GB when decompressed). This includes the GitHub-Python dataset, and all the processed training data and trained models associated with BIFI. If you only want the original GitHub-Python dataset, you can download it from here (data_minimal.zip; 1GB). After unzipping the data.zip, the resulting file structure will look like:

.
├── README.md
└── data/
    ├── orig_bad_code/       (GitHub-Python dataset's bad code)
    ├── orig_good_code/      (GitHub-Python dataset's good code)
    └── round0/
        ├── data_paired      (paired data used to train fixer in round0)
        └── model-fixer      (fixer trained in round0)
    ├── round1-BIFI-part1/
        ├── data_paired      (paired data used to train breaker in BIFI round1)
        └── model-breaker    (breaker trained in BIFI round1)
    ├── round1-BIFI-part2/
        ├── data_paired      (paired data used to train fixer in BIFI round1)
        └── model-fixer      (fixer trained in BIFI round1)
    ├── ...

About the GitHub-Python dataset

We collected 3 million Python3 snippets from GitHub. Using the critic (Python AST parser), the code snippets are split into a set of bad code (with AST parse errors) and a set of good code (with no errors). The set of bad code is located at data/orig_bad_code/orig.bad.json and good code at data/orig_good_code/orig.good.json. Each entry of orig.bad.json or orig.good.json is a dictionary consisting of

  • "code_string": raw code in the string format
  • "code_toks_joined": the raw code is split into tokens by Python tokenizer, anonymized (string/number is replaced with special tokens <STRING>/<NUMBER>), and then joined by whitespace. The tokenization was done by utils/code_utils.py: tokenize_python_code()
  • "anonymize_dict": mapping betweens raw string/number and <STRING>/<NUMBER> so that "code_string" can be recovered from "code_toks_joined". This recovery can be done by utils/code_utils.py: code_toks_to_code_string()
  • "err_obj": type of the error caught by the critic (e.g. unbalanced parentheses, indentation error). This is only applicable to orig.bad.json.

The bad code snippets in orig.bad.json are split into 5 chunks (orig.0.bad to orig.4.bad in data/orig_bad_code/), where 3,4 is heldout as the test set and 0,1,2 is made available for BIFI training. This splitting was done by scripts/split_orig_bad_and_good.py

2. Training and Evaluation

First, train the initial fixer by running commands in src/run-round0.py one by one. We then consider three training algorithms on top of it: BIFI (our proposed method), FixerOnly (BIFI without breaker), and BackTranslation (BT; our baseline). For each algorithm,

  • BIFI: run commands in src/run-BIFI.py one by one
  • FixerOnly: run commands in src/run-FixerOnly.py one by one
  • BT: run commands in src/run-BT.py one by one

Below is an illustration for the case of BIFI.

run-round0.sh

export PYTHONPATH=.

#Train initial fixer on synthetic paired data
python src/c001__train_fixer.py --round_name round0 --gpu_id 0 --max_epoch 2

#Run the trained fixer on the bad code (chunk 0-4) and check the outputs by critic
python src/c003__run_fixer.py   --round_name round0 --gpu_ids '0,1,2,3,4'

#Evaluate the fixer outputs on the test set (chunk 3,4)
python src/c005__eval_fixer.py  --round_name round0

run-BIFI.sh (round 1)

#Use the fixer outputs on the bad code (chunk 0,1,2) to get new paired data (Equation 6 in the paper)
python src/c006__generate_paired_data_from_fixer.py --round_name round0 --out_round_name round1-BIFI-part1

#Train breaker on the new paired data (Equation 7 in the paper)
python src/c002__train_breaker.py --round_name round1-BIFI-part1 --gpu_id 0 --max_epoch 3

#Run the trained breaker on the good code and get new paired data (Equation 8 in the paper)
python src/c004__run_breaker.py   --round_name round1-BIFI-part1 --gpu_ids '0,1,2,3,4'
python src/c007__generate_paired_data_from_breaker.py --round_name round1-BIFI-part1 --out_round_name round1-BIFI-part2

#Train fixer on the new paired data (Equation 9 in the paper)
python src/c001__train_fixer.py --round_name round1-BIFI-part2 --gpu_id 0 --max_epoch 2 --continue_from 'data/round0/model-fixer/checkpoint.pt'

#Run the trained fixer on the bad code (chunk 0-4) and check the outputs by critic
python src/c003__run_fixer.py   --round_name round1-BIFI-part2 --gpu_ids '0,1,2,3,4'

#Evaluate the fixer outputs on the test set (chunk 3,4)
python src/c005__eval_fixer.py  --round_name round1-BIFI-part2

This is repeated similarly for round 2.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].