All Projects → mast-group → tassal

mast-group / tassal

Licence: BSD-3-Clause license
Tree-based Autofolding Software Summarization Algorithm

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to tassal

contextualLSTM
Contextual LSTM for NLP tasks like word prediction and word embedding creation for Deep Learning
Stars: ✭ 28 (-26.32%)
Mutual labels:  topic-modeling
JoSH
[KDD 2020] Hierarchical Topic Mining via Joint Spherical Tree and Text Embedding
Stars: ✭ 55 (+44.74%)
Mutual labels:  topic-modeling
twic
Topic Words in Context (TWiC) is a highly-interactive, browser-based visualization for MALLET topic models
Stars: ✭ 51 (+34.21%)
Mutual labels:  topic-modeling
stmprinter
Print multiple stm model dashboards to a pdf file for inspection
Stars: ✭ 34 (-10.53%)
Mutual labels:  topic-modeling
BTM
Biterm Topic Modelling for Short Text with R
Stars: ✭ 78 (+105.26%)
Mutual labels:  topic-modeling
embeddings-for-trees
Set of PyTorch modules for developing and evaluating different algorithms for embedding trees.
Stars: ✭ 19 (-50%)
Mutual labels:  ml4code
ml
machine learning
Stars: ✭ 29 (-23.68%)
Mutual labels:  topic-modeling
gensimr
📝 Topic Modeling for Humans
Stars: ✭ 35 (-7.89%)
Mutual labels:  topic-modeling
amazon-reviews
Sentiment Analysis & Topic Modeling with Amazon Reviews
Stars: ✭ 26 (-31.58%)
Mutual labels:  topic-modeling
converse
Conversational text Analysis using various NLP techniques
Stars: ✭ 147 (+286.84%)
Mutual labels:  topic-modeling
hlda
Gibbs sampler for the Hierarchical Latent Dirichlet Allocation topic model
Stars: ✭ 138 (+263.16%)
Mutual labels:  topic-modeling
PyLDA
A Latent Dirichlet Allocation implementation in Python.
Stars: ✭ 51 (+34.21%)
Mutual labels:  topic-modeling
adversarial-code-generation
Source code for the ICLR 2021 work "Generating Adversarial Computer Programs using Optimized Obfuscations"
Stars: ✭ 16 (-57.89%)
Mutual labels:  ml4code
tomoto-ruby
High performance topic modeling for Ruby
Stars: ✭ 49 (+28.95%)
Mutual labels:  topic-modeling
lda2vec
Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec from this paper https://arxiv.org/abs/1605.02019
Stars: ✭ 27 (-28.95%)
Mutual labels:  topic-modeling
ml-nlp-services
机器学习、深度学习、自然语言处理
Stars: ✭ 23 (-39.47%)
Mutual labels:  topic-modeling
ctpfrec
Python implementation of "Content-based recommendations with poisson factorization", with some extensions
Stars: ✭ 31 (-18.42%)
Mutual labels:  topic-modeling
learning-stm
Learning structural topic modeling using the stm R package.
Stars: ✭ 103 (+171.05%)
Mutual labels:  topic-modeling
KGE-LDA
Knowledge Graph Embedding LDA. AAAI 2017
Stars: ✭ 35 (-7.89%)
Mutual labels:  topic-modeling
TopicNet
Interface for easier topic modelling.
Stars: ✭ 127 (+234.21%)
Mutual labels:  topic-modeling

TASSAL: Tree-based Autofolding Software Summarization ALgorithm Build Status

TASSAL is a tool for the automatic summarization of source code using autofolding. Autofolding automatically creates a summary of a source code file by folding non-essential code and comment blocks.

NEW: For a live demo of TASSAL that allows you to summarize any GitHub project see:
https://code-summarizer.herokuapp.com

This is an implementation of the code summarizer from our paper:
Autofolding for Source Code Summarization
J. Fowkes, R. Ranca, M. Allamanis, M. Lapata and C. Sutton. arXiv preprint 1403.4503, 2015.

There are two main variants of the algorithm:

  • TASSAL VSM which uses a Vector Space Model for source code - less accurate but very fast (real-time)
  • TASSAL which uses a Topic Model for source code - more accurate but slower (requires training)

both are described below.

Installation

Installing in Eclipse

Simply import as a maven project into Eclipse using the File -> Import... menu option (note that this requires m2eclipse).

It's also possible to export a runnable jar from Eclipse using the File -> Export... menu option.

Compiling a Runnable Jar

To compile a standalone runnable jar, simply run

mvn package

in the main tassal directory (note that this requires maven).

This will create the standalone runnable jar tassal-1.1-SNAPSHOT.jar in the tassal/target subdirectory.

Running TASSAL VSM

TASAAL VSM uses a Vector Space Model of source code tokens to determine which are the least relevent code regions to autofold. TASSAL VSM can run in real-time.

Autofolding a source file

codesum.lm.tui.FoldSourceFileVSM folds a specified source file. It has the following command line options:

  • -f   souce file to autofold
  • -c   desired compression ratio for the file (%)
  • -o   (optional) where to save the folded file

See the individual file javadocs in codesum.lm.tui for information on the Java interface. In Eclipse you can set command line arguments for the TASSAL interface using the Run Configurations... menu option.

Example Usage

A complete example using the command line interface on a runnable jar.

First clone the ActionBarSherlock project into /tmp/java_projects/

$ mkdir /tmp/java_projects/
$ cd /tmp/java_projects/
$ git clone https://github.com/JakeWharton/ActionBarSherlock.git 

We can then fold a specific file

$ java -cp tassal/target/tassal-1.1-SNAPSHOT.jar codesum.lm.tui.FoldSourceFileVSM     
 -c 50
 -f /tmp/java_projects/ActionBarSherlock/actionbarsherlock/src/com/actionbarsherlock/app/SherlockFragment.java 
 -o /tmp/SherlockFragmentFolded.java 

which will output the folded file to /tmp/SherlockFragmentFolded.java.

Running TASSAL

TASAAL uses a scoped Topic Model of source code tokens to determine which are the least relevent code regions to autofold. TASSAL requires the topic model to be trained on a dataset (the larger the better) before it can fold files in the dataset. While this is slower than using a VSM model, it is considerably more accurate.

Training the source code topic model

codesum.lm.tui.TrainTopicModel trains the underlying topic model. It has the following command line options:

  • -d   directory containing java projects
  • -w   working directory where the topic model creates necessary files
  • -i   (optional) no. iterations to train the topic model for.

This will output a summary of the top 25 tokens in some of the discovered topics.

Autofolding a source file

codesum.lm.tui.FoldSourceFile folds a specified source file. It has the following command line options:

  • -w   working directory where the topic model creates necessary files (same as above)
  • -f   souce file to autofold
  • -p   project containing the file to fold
  • -c   desired compression ratio for the file (%)
  • -b   (optional) background topic to back off to (0-2, default=2)
  • -o   (optional) where to save the folded file

See the individual file javadocs in codesum.lm.tui for information on the Java interface. In Eclipse you can set command line arguments for the TASSAL interface using the Run Configurations... menu option.

Example Usage

A complete example using the command line interface on a runnable jar.

First clone the ActionBarSherlock project into /tmp/java_projects/

$ mkdir /tmp/java_projects/
$ cd /tmp/java_projects/
$ git clone https://github.com/JakeWharton/ActionBarSherlock.git

Now you can train the topic model on the java projects in /tmp/java_projects/

$ java -cp tassal/target/tassal-1.1-SNAPSHOT.jar codesum.lm.tui.TrainTopicModel   
 -d /tmp/java_projects/  -w /tmp/ -i 100 

This trains the topic model for 100 iterations and outputs the model to /tmp/. We can then fold a specific file

$ java -cp tassal/target/tassal-1.1-SNAPSHOT.jar codesum.lm.tui.FoldSourceFile     
 -w /tmp/  -c 50 -p ActionBarSherlock 
 -f /tmp/java_projects/ActionBarSherlock/actionbarsherlock/src/com/actionbarsherlock/app/SherlockFragment.java 
 -o /tmp/SherlockFragmentFolded.java 

which will output the folded file to /tmp/SherlockFragmentFolded.java.

Summarizing a Project

TASSAL is also able to summmarize an entire project by finding the top source files (and therefore classes) representative of that project. These top project files can then be autofolded using TASSAL as above (see Autofolding a source file).

codesum.lm.tui.ListSalientFiles lists the most representative source files for a given project. It has the following command line options:

  • -s   working directory where the topic model creates necessary files (same as above)
  • -d   directory containing java projects
  • -p   project to summarize
  • -c   desired compression ratio (% of project files to list)
  • -b   (optional) background topic to back off to (0-2, default=2)
  • -o   (optional) where to save the salient files
  • -i   (optional) whether to ignore unit test files (default=true)

Note that this requires the topic model to first be trained on the given project (see Training the source code topic model above).

Bugs

Please report any bugs using GitHub's issue tracker.

License

This tool is released under the new BSD license.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].