All Projects → EdinburghNLP → Code Docstring Corpus

EdinburghNLP / Code Docstring Corpus

Licence: other
Preprocessed Python functions and docstrings for automated code documentation (code2doc) and automated code generation (doc2code) tasks.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Code Docstring Corpus

Indian ParallelCorpus
Curated list of publicly available parallel corpus for Indian Languages
Stars: ✭ 23 (-83.21%)
Mutual labels:  corpus, neural-machine-translation
Subword Nmt
Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
Stars: ✭ 1,819 (+1227.74%)
Mutual labels:  neural-machine-translation
Avo
Generate x86 Assembly with Go
Stars: ✭ 1,862 (+1259.12%)
Mutual labels:  code-generation
Artman
Artifact Manager, a build and packaging tool for Google API client libraries.
Stars: ✭ 123 (-10.22%)
Mutual labels:  code-generation
Swift Doc
A documentation generator for Swift projects
Stars: ✭ 1,674 (+1121.9%)
Mutual labels:  documentation-generator
Cluedatasetsearch
搜索所有中文NLP数据集,附常用英文NLP数据集
Stars: ✭ 2,112 (+1441.61%)
Mutual labels:  corpus
Colibri Core
Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool ``colibri-patternmodeller`` whi ch allows you to build, view, manipulate and query pattern models.
Stars: ✭ 112 (-18.25%)
Mutual labels:  corpus
Awesome Chatbot
Awesome Chatbot Projects,Corpus,Papers,Tutorials.Chinese Chatbot =>:
Stars: ✭ 1,785 (+1202.92%)
Mutual labels:  corpus
Documentalist
📝 A sort-of-static site generator optimized for living documentation of software projects
Stars: ✭ 130 (-5.11%)
Mutual labels:  documentation-generator
Io Ts Codegen
Code generation for io-ts
Stars: ✭ 123 (-10.22%)
Mutual labels:  code-generation
Awesome Hungarian Nlp
A curated list of NLP resources for Hungarian
Stars: ✭ 121 (-11.68%)
Mutual labels:  corpus
Nlp Models Tensorflow
Gathers machine learning and Tensorflow deep learning models for NLP problems, 1.13 < Tensorflow < 2.0
Stars: ✭ 1,603 (+1070.07%)
Mutual labels:  neural-machine-translation
Khcoder
KH Coder: for Quantitative Content Analysis or Text Mining
Stars: ✭ 126 (-8.03%)
Mutual labels:  corpus
Graphql Markdown
The easiest way to document your GraphQL schema.
Stars: ✭ 114 (-16.79%)
Mutual labels:  documentation-generator
Go Poet
A Go package for generating Go code
Stars: ✭ 134 (-2.19%)
Mutual labels:  code-generation
Goreadme
Generate readme file from Go doc. Now available with Github actions!
Stars: ✭ 113 (-17.52%)
Mutual labels:  code-generation
Quenya
Quenya is a framework to build high-quality REST API applications based on extended OpenAPI spec
Stars: ✭ 121 (-11.68%)
Mutual labels:  code-generation
Dialog corpus
用于训练中英文对话系统的语料库 Datasets for Training Chatbot System
Stars: ✭ 1,662 (+1113.14%)
Mutual labels:  corpus
Toolkit
Collection of useful patterns
Stars: ✭ 137 (+0%)
Mutual labels:  code-generation
Marian Dev
Fast Neural Machine Translation in C++ - development repository
Stars: ✭ 136 (-0.73%)
Mutual labels:  neural-machine-translation

code-docstring-corpus

This repository contains preprocessed Python functions and docstrings for automated code documentation (code2doc) and automated code generation (doc2code) tasks.

Paper: https://arxiv.org/abs/1707.02275

Update

The code-docstring-corpus version 2, with class declarations, class methods, module docstrings and commit SHAs is now available in the directory V2

Installation

The dependencies can be installed using pip:

pip install -r requirements.txt

Extraction scripts require AST Unparser ( https://github.com/simonpercivall/astunparse ), NMT tokenization requires the Moses tokenizer scripts ( https://github.com/moses-smt/mosesdecoder )

Details

We release a parallel corpus of 150370 triples of function declarations, function docstrings and function bodies. We include multiple corpus splits, and an additional "monolingual" code-only corpus with corresponding synthetically generated docstrings.

The corpora were assembled by scraping from open source GitHub repository with the GitHub scraper used by Bhoopchand et al. (2016) "Learning Python Code Suggestion with a Sparse Pointer Network" (paper: https://arxiv.org/abs/1611.08307 - code: https://github.com/uclmr/pycodesuggest ) .

The Python code was then preprocessed to normalize the syntax, extract top-level functions, remove comments and semantically irrelevant whitespaces, and separate declarations, docstrings (if present) and bodies. We did not extract classes and their methods.

directory description
parallel-corpus Main parallel corpus with a canonical split in 109108 training triples, 2000 validation triples and 2000 test triples. Each triple is annotated by metadata (repository owner, repository name, source file and line number). Also two versions of the above corpus reassembled into pairs: (declaration+body, docstring) and (declaration+docstring, body), for code documentation tasks and code generation tasks, respectively. You may refer to the Readme in this folder for descriptions about escape tokens
code-only-corpus A code-only corpus of 161630 pairs of function declarations and function bodies, annotated with metadata.
backtranslations-corpus A corpus of docstrings automatically generated from the code-only corpus using Neural Machine Translation, to enable data augmentation by "backtranslation"
nmt-outputs Test and validation outputs of the baseline Neural Machine Translation models.
repo_split.parallel-corpus An alternate train/validation/test split of the parallel corpus which is "repository-consistent": no repository is split between training, validation or test sets.
repo_split.code-only-corpus A "repository-consistent" filtered version of the code-only corpus: it only contains fragments which appear in the training set of the above repository.
scripts Preprocessing scripts used to generate the corpora.
V2 code-docstring-corpus version 2, with class declarations, class methods, module docstrings and commit SHAs.

Baseline results

In order to compute baseline results, the data from the canonical split (parallel-corpus directory) was further sub-tokenized using Sennrich et al. (2016) "Byte Pair Encoding" (paper: https://arxiv.org/abs/1508.07909 - code: https://github.com/rsennrich/subword-nmt ). Finally, we trained baseline Neural Machine Translation models for both the code2doc and the doc2code tasks using Nematus (Sennrich et al. 2017, paper: https://arxiv.org/abs/1703.04357 - code: https://github.com/rsennrich/nematus ).

Baseline outputs are available in the nmt-outputs directory.

We also used the code2doc model to generate the docstring corpus from the code-only corpus which is available in the backtranslations-corpus directory.

Model Validation BLEU Test BLEU
declbodies2desc.baseline 14.03 13.84
decldesc2bodies.baseline 10.32 10.24
decldesc2bodies.backtransl 10.85 10.90

Bleu scores are computed using Moses multi-bleu.perl script

Reference

If you use this corpus for a scientific publication, please cite: Miceli Barone, A. V. and Sennrich, R., 2017 "A parallel corpus of Python functions and documentation strings for automated code documentation and code generation" arXiv:1707.02275 https://arxiv.org/abs/1707.02275

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].