About SubModLib

SubModLib is an easy-to-use, efficient and scalable Python library for submodular optimization with a C++ optimization engine. Submodlib finds its application in summarization, data subset selection, hyper parameter tuning, efficient training etc. Through a rich API, it offers a great deal of flexibility in the way it can be used.

Please check out our latest arxiv preprint: https://arxiv.org/abs/2202.10680

Salient Features

Rich suite of functions for a wide variety of subset selection tasks:
- regular set (submodular) functions
- submodular mutual information functions
- conditional gain functions
- conditional mutual information functions
Supports different types of optimizers
- naive greedy
- lazy (accelerated) greedy
- stochastic (random) greedy
- lazier than lazy greedy
Combines the best of Python's ease of use and C++'s efficiency
Rich API which gives a variety of options to the user. See this notebook for an example of different usage patterns
De-coupled function and optimizer paradigm makes it suitable for a wide-variety of tasks
Comprehensive documentation (available here)

Google Colab Notebooks Demonstrating the power of SubModLib and sample usage

Modelling capabilities of regular submodular functions
Modelling capabilities of submodular mutual information (SMI) functions
Modelling capabilities of conditional gain (CG) functions
Modelling capabilities of conditional mutual information (CMI) functions
This notebook contains a quantitative analysis of performance of different functions and role of the parameterization in aspects like query-coverage, query-relevance, privacy-irrelevance and diversity for different SMI, CG and CMI functions as observed on synthetically generated dataset.
This notebook contains similar analysis on ImageNet dataset.
For a more detailed discussion on all possible usage patterns, please see Different Options of Usage

Setup

Alternative 1

$ pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ submodlib

Alternative 2 (if local docs need to be built and test cases need to be run)

$ git clone https://github.com/decile-team/submodlib.git
$ cd submodlib
$ pip install .
Latest documentation is available at readthedocs. However, if local documentation is required to be built, follow these steps::
- $ pip install -U sphinx
- $ pip install sphinxcontrib-bibtex
- $ pip install sphinx-rtd-theme
- $ cd docs
- $ make clean html
To run the tests, follow these steps:
- $ pip install pytest
- $ pytest # this runs ALL tests
- $ pytest -m <marker> --verbose --disable-warnings -rA # this runs test specified by the . Possible markers are mentioned in pyproject.toml file.

Usage

It is very easy to get started with submodlib. Using a submodular function in submodlib essentially boils down to just two steps:

instantiate the corresponding function object
invoke the desired method on the created object

The most frequently used methods are:

f.evaluate() - takes a subset and returns the score of the subset as computed by the function f
f.marginalGain() - takes a subset and an element and returns the marginal gain of adding the element to the subset, as computed by f
f.maximize() - takes a budget and an optimizer to return an optimal set as a result of maximizing f

For example,

from submodlib import FacilityLocationFunction
objFL = FacilityLocationFunction(n=43, data=groundData, mode="dense", metric="euclidean")
greedyList = objFL.maximize(budget=10,optimizer='NaiveGreedy')

For a more detailed discussion on all possible usage patterns, please see Different Options of Usage

Functions

Regular set (submodular) functions (for vanilla subset selection requiring representation, diversity, importance, relevance, coverage)
Submodular mutual information functions (for query-focused subset selelction)
Conditional gain functions (for query-irrelevant/privacy-preserving subset selection)
Conditional mutual information functions (for joint query-focused and privacy-preserving subset selection)

Modelling Capabilities of Different Functions

We demonstrate the representational power and modeling capabilities of different functions qualitatively in the following Google Colab notebooks:

This notebook contains a quantitative analysis of performance of different functions and role of the parameterization in aspects like query-coverage, query-relevance, privacy-irrelevance and diversity for different SMI, CG and CMI functions as observed on synthetically generated dataset. This notebook contains similar analysis on ImageNette dataset.

Optimizers

NaiveGreedy
LazyGreedy
StochasticGreedy
LazierThanLazyGreedy
This notebook demonstrates the use and comparison of different optimizers.

Sample Application (Image collection summarization)

This notebook contains demonstration of using submodlib for an image collection summarization application.

Timing Analysis

To gauge the performance of submodlib, selection by Facility Location was performed on a randomly generated dataset of 1024-dimensional points. Specifically the following code was run for the number of data points ranging from 50 to 10000.

K_dense = helper.create_kernel(dataArray, mode="dense", metric='euclidean', method="other")
obj = FacilityLocationFunction(n=num_samples, mode="dense", sijs=K_dense, separate_rep=False,pybind_mode="array")
obj.maximize(budget=budget,optimizer=optimizer, stopIfZeroGain=False, stopIfNegativeGain=False, verbose=False, show_progress=False)

The above code was timed using Python's timeit module averaged across three executions each. We report the following numbers:

Number of data points	Time taken (in seconds)
50	0.00043
100	0.001074
200	0.003024
500	0.016555
1000	0.081773
5000	2.469303
6000	3.563144
7000	4.667065
8000	6.174047
9000	8.010674
10000	9.417298

Contributors

Vishal Kaushal, Ganesh Ramakrishnan and Rishabh Iyer. Currently maintained by CARAML Lab

Contact

Should you face any issues or have any feedback or suggestions, please feel free to contact vishal[dot]kaushal[at]gmail.com

Acknowledgements

This work is supported by the Ekal Fellowship (www.ekal.org). This work is also supported by the National Science Foundation(NSF) under Grant Number 2106937, a startup grant from UT Dallas, as well as Google and Adobe awards.

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

decile-team / submodlib

Programming Languages

Labels

Projects that are alternatives of or similar to submodlib