Interactive tree topic modeling

Yuening Hu, Jordan Boyd-Graber, and Brianna Satinoff. Interactive Topic Modeling. Association for Computational Linguistics, 2011.

Topic models have been used extensively as a tool for corpus exploration, and a cottage industry has developed to tweak topic models to better encode human intuitions or to better model data. However, creating such extensions requires expertise in machine learning unavailable to potential end-users of topic modeling software. In this work, we develop a framework for allowing users to iteratively refine the topics discovered by models such as latent Dirichlet allocation (LDA) by adding constraints that enforce that sets of words must appear together in the same topic.

The project code has been Mavenified and lightly edited by Travis Brown. All dependencies are now managed by Maven and are not packaged with the project. The Java source files have also been moved out of the MALLET namespace, and it is no longer necessary to merge them manually with the MALLET source.

Compiling

Apache Maven is required to build this project. The following command will download all dependencies (as necessary) and compile the code:

mvn compile

The class files will now be available in target/classes, and will be used when you run the bin/mallet script in subsequent steps.

Importing documents

The following command will convert documents into the MALLET format as described in the MALLET documentation:

bin/mallet import-dir --input ../../../data/synthetic/synth_word \
  --output input/synthetic-topic-input.mallet --keep-sequence

Note that for this synthetic data set we do not use --remove-stopwords, but in general you would want to include it here. Note also that the input directory contains the synthetic-topic-input.mallet file, so you can skip this step and continue directly to the steps below.

Generating vocabulary file

bin/mallet train-topics --input input/synthetic-topic-input.mallet \
  --use-tree-lda true --generate-vocab true --vocab input/synthetic/synthetic.voc

Generating the tree

The following command requires Python 2, so you may need to change the python command if Python 3 is the default on your system.

python tree/ontology_writer_wordleaf.py --vocab=input/synthetic/synthetic.voc \
  --constraints=input/empty.cons --write_wordnet=False \
  --write_constraints=True --wnname=input/synthetic/synthetic.wn

Note that the constraints file can be empty, in which case the output is a tree with symmetric priors, working as in normal LDA.

You can check the generated tree structure with the following commands (note that Protobuf 2.3 is required):

cat input/synthetic/synthetic.wn.0 | protoc tree/lib/proto/wordnet_file.proto \
  --decode=topicmod_projects_ldawn.WordNetFile \
  --proto_path=tree/lib/proto/ > input/synthetic/tmp0.txt

cat input/synthetic/synthetic.wn.1 | protoc tree/lib/proto/wordnet_file.proto \
  --decode=topicmod_projects_ldawn.WordNetFile \
  --proto_path=tree/lib/proto/ > input/synthetic/tmp1.txt

Training the tree topic model

bin/mallet train-topics --input input/synthetic-topic-input.mallet --num-topics 5 \
  --num-iterations 300 --alpha 0.5 --random-seed 0 --output-interval 10 \
  --output-dir output/model --use-tree-lda True --tree-model-type fast \
  --tree input/synthetic/synthetic.wn --tree-hyperparameters input/tree_hyperparams \
  --vocab input/synthetic/synthetic.voc --clear-type term --constraint input/empty.cons

Resuming the tree topic model

bin/mallet train-topics --input input/synthetic-topic-input.mallet --num-topics 5 \
  --num-iterations 600 --alpha 0.5 --random-seed 0 --output-interval 10 \
  --output-dir output/model --use-tree-lda True \
  --tree input/synthetic/synthetic.wn --tree-hyperparameters input/tree_hyperparams \
  --vocab input/synthetic/synthetic.voc --clear-type term --constraint input/empty.cons \
  --resume true --resume-dir output/model

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

travisbrown / itm

Programming Languages