AutoCoder

A basic and simple tool for code auto completion, fine-tuned from the pytorch pre-trained GPT-2 variants offered by the awesome 🤗 transformers library.

Demo

Play on 🤗HF's Model Hub👇

Features

Write with Python or Java.

Blog linked to this project

The details of dataset construction and fine-tuning process

Quick Start

Here provides three ways of quick-start. Before that,

Load from 🤗transformers models

Now there are two fine-tuned models uploded to 🤗transformers models library. They can be used easily as long as you pip install transformers

from transformers import AutoTokenizer,AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("congcongwang/gpt2_medium_fine_tuned_coder")
model = AutoModelWithLMHead.from_pretrained("congcongwang/gpt2_medium_fine_tuned_coder")
# or
# tokenizer = AutoTokenizer.from_pretrained("congcongwang/distilgpt2_fine_tuned_coder")
# model = AutoModelWithLMHead.from_pretrained("congcongwang/distilgpt2_fine_tuned_coder")
use_cuda=True
context="def factorial"
lang="python" # can be java as well.

if use_cuda:
    model.to("cuda")

input_ids = tokenizer.encode("<python> " + context,
                                     return_tensors='pt') if lang == "python" else tokenizer.encode(
            "<java> " + context, return_tensors='pt')
outputs = model.generate(input_ids=input_ids.to("cuda") if use_cuda else input_ids,
                         max_length=128,
                         temperature=0.7,
                         num_return_sequences=1)

decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded)

Ready-to-go Interaction

git clone https://github.com/wangcongcong123/auto_coding.git
pip install -r requirements.txt

Download the fine-tuned models, here are two versions provided.
- distilgpt2_fine_tuned_coder (params: 82M, size: 291MB)
- gpt2_medium_fine_tuned_coder.zip (params: 345M, size: 1.22GB)
Unzip the model and move it to ./model (create it first)
Run the interact: python interact.py

Fine-tuning yours

git clone <this repository>
pip install -r requirements.txt

Preparing the dataset
Start fine-tuning model: python train.py --model_select distilgpt2
After fine-tuning, the model will be saved to ./model/distilgpt2_fine_tuned_coder/0_GPTSingleHead which is exactly the fine-tuned version as provided in Ready-to-go Interaction.

* For more params setting of training, python train.py -h

Generative examples

Good Python generation examples by fine-tuned GPT2-medium

-------------Example 1--------------------------------------
Context code: def factorial
Generated: 
def factorial(n):
    return 1 if n == 1 else n * factorial(n - 1)

--------------Example 2-------------------------------------
Context code: def sum_of_series(num_of_terms:int, power: int)
Generated:
def sum_of_series(num_of_terms:int, power: int) -> float:
    """Calculate Sum of Series.
    >>> sum_of_series(1, 1, 10)
    55.0
    >>> sum_of_series(1, 10, 100)
    49600.0
    """
    sum = (num_of_terms / 2) * power
    return sum

--------------Example 3-------------------------------------
Context code: def quick_sort(collection)
Generated:
def quick_sort(collection):
    """
    >>> quick_sort([0, 5, 3, 2, 2])
    [0, 2, 2, 3, 5]
    >>> quick_sort([])
    []
    >>> quick_sort([-2, -5, -45])
    [-45, -5, -2]
    """
    length = len(collection)
    if length <= 1:
        return collection
    else:
        # Use the last element as the first pivot
        pivot = collection.pop()
        # Put elements greater than pivot in greater list
        # Put elements lesser than pivot in lesser list
        greater, lesser = [], []
        for element in collection:
            if element > pivot:
                ...

Good Java generation examples by fine-tuned GPT2-medium

--------------Example 1-------------------------------------
Context code: Scanner input= new Scanner(System.in)
Generated:
Scanner input= new Scanner(System.in)
System.out.println("Enter the Octal number: ");
// Take octal number as input from user in a string
String oct = input.next();

--------------Example 2-------------------------------------
Context code: private static int CountCharacters(String str)
Generated:
private static int CountCharacters(String str) {
        return str.replaceAll("\\s", "").length();
}

* Although some generated examples look good, it needs to take a grain of salt to judge the model's actual performance. The model may simply "remembers" existing code in the training set well.

TODO list

Expand the dataset (and construct the dataset more carefeully) and increase context window. Try larger generative models like GPT-2 large or even GPT-3 variants as proposed recently if the computational resources are allowed.
Remove overlapping between training examples and dev examples for contamination studies. That says, to what extent the model memorizes examples rigidly or at surface heuristics level during training.
Try some adversarial examples (more complicated for model's reasoning capability testing purpose) to test the robustness of the model.
Integrate this into real-life use case such as a code editor - Sublime Text, where a threshold of joint probability may need to be studied for code snippet recommendations.
Try some ideas of location-aware code generation. For example, if a human coder is sitting writing a comment, the autocoder should be aware of the coder's context (left and right if available) to help complete the corresponding content.
Model size and inference efficiency is a problem in real-life use cases.
Do research in this problem domain to grab a general idea of what work has done in the literature for this particular problem.

Extra notes

For mutli-GPU training, it only works when torch==1.4.0. It will be not working when torch==1.5.0. No idea so far how to fix this issue.

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

wangcongcong123 / auto_coding

Programming Languages

Labels

Projects that are alternatives of or similar to auto coding