All Projects → google-research → plur

google-research / plur

Licence: Apache-2.0 license
PLUR (Programming-Language Understanding and Repair) is a collection of source code datasets suitable for graph-based machine learning. We provide scripts for downloading, processing, and loading the datasets. This is done by offering a unified API and data structures for all datasets.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to plur

Learn Something Every Day
📝 A compilation of everything that I learn; Computer Science, Software Development, Engineering, Math, and Coding in General. Read the rendered results here ->
Stars: ✭ 362 (+440.3%)
Mutual labels:  research, software-engineering
ml4se
A curated list of papers, theses, datasets, and tools related to the application of Machine Learning for Software Engineering
Stars: ✭ 46 (-31.34%)
Mutual labels:  research, software-engineering
mlst check
Multilocus sequence typing by blast using the schemes from PubMLST
Stars: ✭ 22 (-67.16%)
Mutual labels:  research
googlemock-example
An example of using Google Mock inspired by Martin Fowler's "Mocks Aren't Stubs".
Stars: ✭ 23 (-65.67%)
Mutual labels:  software-engineering
knowledge
Everything I know. My knowledge wiki. My notes (mostly for fast.ai). Document everything. Brain dump.
Stars: ✭ 118 (+76.12%)
Mutual labels:  research
path semantics
A research project in path semantics, a re-interpretation of functions for expressing mathematics
Stars: ✭ 136 (+102.99%)
Mutual labels:  research
code-examples
Code examples from the https://sttp.site book
Stars: ✭ 19 (-71.64%)
Mutual labels:  software-engineering
programming-note
Lecture notes on Computer Science, Web Development, and System Design
Stars: ✭ 199 (+197.01%)
Mutual labels:  software-engineering
engineering-management
A list of resources about Software Engineering Management
Stars: ✭ 31 (-53.73%)
Mutual labels:  software-engineering
llvm-semantics
Formal semantics of LLVM IR in K
Stars: ✭ 42 (-37.31%)
Mutual labels:  research
cs-sakaryauniversity
Sakarya Üniversitesi'nde okuduğum süre boyunca karşıma çıkan tüm ödevler, ders notları ve çıkmış sınav soruları (All the assignments, lecture notes and exams)
Stars: ✭ 133 (+98.51%)
Mutual labels:  software-engineering
mozilla-sprint-2018
DEPRECATED & Materials Moved: This sprint was to focus on brainstorming for the Joint Roadmap for Open Science Tools.
Stars: ✭ 24 (-64.18%)
Mutual labels:  research
Quantum-Computing-Opportunities
Moved to Gitlab
Stars: ✭ 43 (-35.82%)
Mutual labels:  research
hamilton
A scalable general purpose micro-framework for defining dataflows. You can use it to create dataframes, numpy matrices, python objects, ML models, etc.
Stars: ✭ 612 (+813.43%)
Mutual labels:  software-engineering
Recommendation-System-Baseline
Some common recommendation system baseline, with description and link.
Stars: ✭ 34 (-49.25%)
Mutual labels:  research
ogrants
Open grants list
Stars: ✭ 96 (+43.28%)
Mutual labels:  research
website
Project Free Our Knowledge aims to organise collective action in support of open and reproducible research practices. This repository is used to design new campaigns (using the issues feature) and to build the website (www.freeourknowledge.org).
Stars: ✭ 32 (-52.24%)
Mutual labels:  research
jdit
Jdit is a research processing oriented framework based on pytorch. The docs is here!
Stars: ✭ 29 (-56.72%)
Mutual labels:  research
cerberus research
Research tools for analysing Cerberus banking trojan.
Stars: ✭ 110 (+64.18%)
Mutual labels:  research
ideas4
An Additional 100 Ideas for Computing https://samsquire.github.io/ideas4/
Stars: ✭ 26 (-61.19%)
Mutual labels:  software-engineering

PLUR

PLUR (Programming-Language Understanding and Repair) is a collection of source code datasets suitable for graph-based machine learning. We provide scripts for downloading, processing, and loading the datasets. This is done by offering a unified API and data structures for all datasets.

Installation

SRC_DIR=${PWD}/src
mkdir -p ${SRC_DIR} && cd ${SRC_DIR}
# For Cubert.
git clone https://github.com/google-research/google-research --depth=1
export PYTHONPATH=${PYTHONPATH}:${SRC_DIR}/google-research
git clone https://github.com/google-research/plur && cd plur
python -m pip install -r requirements.txt
python setup.py install

Test execution on small dataset

cd plur
python3 plur_data_generation.py --dataset_name=manysstubs4j_dataset \
  --stage_1_dir=/tmp/manysstubs4j_dataset/stage_1 \
  --stage_2_dir=/tmp/manysstubs4j_dataset/stage_2 \
  --train_data_percentage=40 \
  --validation_data_percentage=30 \
  --test_data_percentage=30

Usage

Basic usage

Data generation (step 1)

Data generation is done by calling plur.plur_data_generation.create_dataset(). The data generation runs in two stages:

  1. Convert raw data to plur.utils.GraphToOutputExample.
  2. Convert plur.utils.GraphToOutputExample to TFExample.

Stage 1 is unique for each dataset, but stage 2 is the same for almost all datasets.

from plur.plur_data_generation import create_dataset

dataset_name = 'manysstubs4j_dataset'
dataset_stage_1_directory = '/tmp/manysstubs4j_dataset/stage_1'
stage_1_kwargs = dict()
dataset_stage_2_directory = '/tmp/manysstubs4j_dataset/stage_2'
stage_2_kwargs = dict()
create_dataset(dataset_name, dataset_stage_1_directory, dataset_stage_2_directory, stage_1_kwargs, stage_2_kwargs)

plur_data_generation.py also provides a command line interface, but it offers less flexibility.

python3 plur_data_generation.py --stage_1_dir=/tmp/manysstubs4j_dataset/stage_1 --stage_2_dir=/tmp/manysstubs4j_dataset/stage_2

Data loader (step 2)

After the data is generated, you can use PlurDataLoader to load the data. The data loader loads TFExamples but returns them as numpy arrays.

from plur.plur_data_loader import PlurDataLoader
from plur.util import constants

dataset_stage_2_directory = '/tmp/manysstubs4j_dataset/stage_2'
split = constants.TRAIN_SPLIT_NAME
batch_size = 32
repeat_count = -1
drop_remainder = True
train_data_generator = PlurDataLoader(dataset_stage_2_directory, split, batch_size, repeat_count, drop_remainder)

for batch_data in train_data_generator:
  # your training loop...

Training (step 3)

This is where users of the PLUR framework plug in their custom ML models and code to train and generate predictions for PLUR tasks.

We provide the models for GGNN, Transformer and GREAT models from the PLUR paper. See below for sample commands. For the full set of command line FLAGS, see plur/model_design/train.py.

Training

python3 train.py \
 --data_dir=/tmp/manysstubs4j_dataset/stage_2 \
 --exp_dir=/tmp/experiments/exp12345

Evaluation / Generating predictions

python3 train.py \
 --data_dir=/tmp/manysstubs4j_dataset/stage_2 \
 --exp_dir=/tmp/experiments/exp12345 \
 --evaluate=true

Evaluating (step 4)

Once the training is finished and you have generated natural text predictions on the test data, you can use plur_evaluator.py to evaluate the performance. plur_evaluator.py works in offline mode, meaning that it expects one or more files containing the ground truths, and matching files containing the predictions.

python3 plur_evaluator.py --dataset_name=manysstubs4j_dataset --target_file_pattern=/tmp/manysstubs4j_dataset/targets.txt --prediction_file_pattern=/tmp/manysstubs4j_dataset/predictions.txt

When using multiple evaluation "rounds", the evaluator may create multiple targets and predictions files, formatted as ...predictions-0-of-5.txt; you can refer to all of these combined using a Glob file pattern such as ...predictions-?-of-5.txt in the command above.

For more details about how plur_evaluator works see plur/eval/README.md.

Transforming and filtering data

If there is something fundamental you want to change in the dataset, you should apply them in stage 1 of data generation, otherwise apply them in stage 2. The idea is that stage 1 should only be run once per dataset (to create the plur.utils.GraphToOutputExample), and stage 2 should be run each time you want to train on different data (to create the TFRecords).

All transformation and filtering functions are applied on plur.utils.GraphToOutputExample, see plur.utils.GraphToOutputExample for more information.

E.g. a transformation that can be run in stage 1 is that your model expects that graphs in the dataset have no loop, and you write your transformation function to remove loops. This will ensure that stage 2 will read data where the graph has no loops.

E.g. of filters that can be run in stage 2 is that you want to check your model performance on different graph sizes in terms of number of nodes. You write your own filter function to filter graphs with a large number of nodes.

from plur.plur_data_generation import create_dataset

dataset_name = 'manysstubs4j_dataset'
dataset_stage_1_directory = '/tmp/manysstubs4j_dataset/stage_1'
stage_1_kwargs = dict()
dataset_stage_2_directory = '/tmp/manysstubs4j_dataset/stage_2'
def _filter_graph_size(graph_to_output_example, graph_size=1024):
  return len(graph_to_output_example.get_nodes()) <= graph_size
stage_2_kwargs = dict(
    train_filter_funcs=(_filter_graph_size,),
    validation_filter_funcs=(_filter_graph_size,)
)
create_dataset(dataset_name, dataset_stage_1_directory, dataset_stage_2_directory, stage_1_kwargs, stage_2_kwargs)

Advanced usage

plur.plur_data_generation.create_dataset() is just a thin wrapper around plur.stage_1.plur_dataset and plur.stage_2.graph_to_output_example_to_tfexample.

from plur.plur_data_generation import create_dataset

dataset_name = 'manysstubs4j_dataset'
dataset_stage_1_directory = '/tmp/manysstubs4j_dataset/stage_1'
stage_1_kwargs = dict()
dataset_stage_2_directory = '/tmp/manysstubs4j_dataset/stage_2'
stage_2_kwargs = dict()
create_dataset(dataset_name, dataset_stage_1_directory, dataset_stage_2_directory, stage_1_kwargs, stage_2_kwargs)

is equivalent to

from plur.stage_1.manysstubs4j_dataset import ManySStubs4jJDataset
from plur.stage_2.graph_to_output_example_to_tfexample import GraphToOutputExampleToTfexample

dataset_name = 'manysstubs4j_dataset'
dataset_stage_1_directory = '/tmp/manysstubs4j_dataset/stage_1'
dataset_stage_2_directory = '/tmp/manysstubs4j_dataset/stage_2'
dataset = ManySStubs4jJDataset(dataset_stage_1_directory)
dataset.stage_1_mkdirs()
dataset.download_dataset()
dataset.run_pipeline()

dataset = GraphToOutputExampleToTfexample(dataset_stage_1_directory, dataset_stage_2_directory, dataset_name)
dataset.stage_2_mkdirs()
dataset.run_pipeline()

You can check out plur.stage_1.manysstubs4j_dataset for dataset specific arguments.

from plur.stage_1.manysstubs4j_dataset import ManySStubs4jJDataset

dataset_name = 'manysstubs4j_dataset'
dataset_stage_1_directory = '/tmp/manysstubs4j_dataset/stage_1'

dataset = ManySStubs4jJDataset(dataset_stage_1_directory, dataset_size='large')
dataset.stage_1_mkdirs()
dataset.download_dataset()
dataset.run_pipeline()

Adding a new dataset

All datasets should inherit plur.stage_1.plur_dataset.PlurDataset, and placed under plur/stage_1/, which requires you to implement:

  • download_dataset(): Code to download the dataset, we provide download_dataset_using_git() to download from git and download_dataset_using_requests() to download from a URL, which also works with a Google Drive URL. In download_dataset_using_git() we download the dataset from a specific commit id. In download_dataset_using_requests() we check the sha1sum for the downloaded files. This is to ensure that the same version of PLUR downloads the same raw data.
  • get_all_raw_data_paths(): It should return a list of paths, where each path is a file containing the raw data in the datasets.
  • raw_data_paths_to_raw_data_do_fn(): It should return a beam.DoFn class that overrides process(). The process() should tell beam how to open the files returned by get_all_raw_data_paths(). It is also here we define if the data belongs to any split (train/validation/test).
  • raw_data_to_graph_to_output_example(): This function transforms raw data from raw_data_paths_to_raw_data_do_fn() to GraphToOutputExample.

Then add/change the following lines in plur/plur_data_generation.py:

from plur.stage_1.foo_dataset import FooDataset

flags.DEFINE_enum(
    'dataset_name',
    'dummy_dataset',
    (
        'code2seq_dataset',
        'convattn_dataset',
        'dummy_dataset',
        # [...]
        'retrieve_and_edit_dataset',
        'foo_dataset',
    ),
    'Name of the dataset to generate data.')

# [...]
def get_dataset_class(dataset_name):
  """Get the dataset class based on dataset_name."""
  if dataset_name == 'code2seq_dataset':
    return Code2SeqDataset
  elif dataset_name == 'convattn_dataset':
    return ConvAttnDataset
  elif dataset_name == 'dummy_dataset':
    return DummyDataset
  # [...]
  elif dataset_name == 'retrieve_and_edit_dataset':
    return RetrieveAndEditDataset
  elif dataset_name == 'foo_dataset':
    return FooDataset
  else:
    raise ValueError('{} is not supported.'.format(dataset_name))

Evaluation details

The details of how evaluation is performed are in plur/eval/README.md.

License

Licensed under the Apache 2.0 License.

Disclaimer

This is not an officially supported Google product.

Citation

Please cite the PLUR paper, Chen et al. https://proceedings.neurips.cc//paper/2021/hash/c2937f3a1b3a177d2408574da0245a19-Abstract.html

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].