Markovian

Automatic integration and probabilistic programming in the spirit of Church and Anglican.

Research Questions

Is there a way to compile PGMs to probabilistic circuits?
Is there a algebra that unifies [back/belief/expectation/equilibrium]-prop on PGMs/PCs?
Is there a family of functions which is closed over differentiation and integration?
Is there a tractable inversion sampling procedure for higher dimensional quantiles?
Is there a way to perform inference on Bayesian networks using backprop?
Is there a formula for propagating uncertainty through elementary functions?

Summary

The way we teach probability in school is too abstract. First we learn about probability distributions. Where did these distributions come from? How did they arise? To gain a better understanding of probabilistic reasoning, we should start from first principles with an application firmly in mind, and build up our intuition through simple programming exercises.

Why does probability exist? Probability exists because we have imperfect information about the world. Whether due to inaccessibility, poor visibility, measurement error or other sources, our sensing instruments are only capable of gathering a blurry representation of their environment. We are thus permanently stranded in a state of uncertainty about the world around us. Nevertheless, we would like to reconstruct and reason about possible realities to make informed decisions.

Most computers are by contrast, deterministic and fully observable. How could we use a deterministic machine to simulate a stochastic one? We must first have a source of noise. This may come from a the outside world through a special input device called a random number generator (RNG), or lacking one, from an artificial or pseudo-RNG (PRNG). A good PRNG will be indistinguishable from an RNG to any observer who does not know its internal state, and whose internal state cannot be easily guessed by observing the outputs. Many suitable candidates have been proposed.

What is a probability distribution? A probability distribution is a set of elements, accompanied by the relative frequency which they occur. Suppose we have a foreign language which we do not understand. How could we learn it? First, we might read some text and gather various statistics, such as the alphabet size and how many times each symbol occurs. We could summarize this information in a data structure called a histogram. There are many useful algorithms for computing these summaries efficiently, exactly or approximately.

Now, suppose we want to sample from our probabilistic model. To do so, we could take our histogram and for each symbol, compute a running sum up to its histogram index, called a cumulative distribution function (CDF). We draw a sample from our PRNG to get a number uniformly between 0 and 1 in increments of 1/|alphabet|, then select the symbol whose CDF is closest to that number and voila! we have obtained a sample.

In order to sample longer sequences, we might want to incorporate some context, such as pairs of adjacent characters. To do so, we could build a two-dimensional histogram, sample the first symbol from the "marginal" distribution P(T₁=t₁), and the second from the "conditional" distribution P(T₂=t₂|T₁=t₁), the probability of the second character being t₂ given the preceding character was t₁. This data structure is called a Markov or transition matrix.

P(T₁=t₁,T₂=t₂) = P(T₂=t₂|T₁=t₁)P(T₁=t₁)

String: abcbbbbccba…
        1   2   3   4   5   6   7   8   9   10  …
Window: ab, bc, cb, bb, bb, bb, bc, cc, cb, ba, …
Counts: 1 , 2 , 2 , 3 , 3 , 3 , 2 , 1 , 2 , 1 , …
Transition matrix at index=10:
   a  b  c …
 ┌─────────
a│ 0  1  0
b│ 1  3  2
c│ 0  1  1
⋮          / 10

More generally, we might have longer windows containing triples or n-tuples of contiguous symbols. To represent longer contexts, we could record their probabilities into a multidimensional array or transition tensor, representing the probability of a subsequence t₁t₂...tₙ. This tensor is a probability distribution whose conditionals "slice" or disintegrate the tensor along a dimension, producing an n-1 dimensional hyperplane, the conditional probability of observing a given symbol in a given slot:

P(T₁=t₁,T₂=t₂,…,Tₙ=tₙ) = P(Tₙ=tₙ|Tₙ₋₁=tₙ₋₁, Tₙ₋₂=tₙ₋₂, …,T₁=t₁),

where the tensor rank n is given by the context length, T₁...ₙ are random variables and t₁...ₙ are their concrete instantiations. This tensor is a hypercube with shape |alphabet|ⁿ - Each entry identifies a unique subsequence of n symbols, and the probability of observing them in the same context. Note the exponential state space of this model - as n grows larger, this will quickly require a very large amount of space to represent.

The first improvement we could make is to sparsify the tensor, i.e., only record its nonzero entries in some sort of list-based data structure, or sparse dictionary. Suppose in the previous example where n=2, we only stored nonzero entries as a list of pairs of bigrams to their frequency. By doing so, we could reduce the space consumption by 1/3. We could further reduce the space by only storing duplicate frequencies once. This would improve the space consumption by a small factor for multimodal distributions.

  Matrix               Sparse List           Bidirectional Map 
  
   a  b  c …                                                   
 ┌─────────                                  (ab,ba,cc) <-> 1   
a│ 0  1  0       [(ab,1),                       (bc,cb) <-> 2   
b│ 1  3  2        (ba,1),(bb,3),(bc,2)             (bb) <-> 3   
c│ 0  2  1               (cb,2),(cc,1)]            else  -> 0

However, we can do even better! Since the prefixes *b and *c occur more than once, we could store the transition counts as a prefix tree of pairs, whose first entry records the prefix and second records its frequency. Like before, we could compress this into a DAG to deduplicate leaves with equal-frequency. This might be depicted as follows:

          Prefix Tree                         Prefix DAG
             (*,10)                              (*,10)
  ┌────────────┼───────────────┐          ┌────────┼────────────┐
(a,1)        (b,6)           (c,3)        a──┐  6──b   ┌────────c
  │      ┌─────┼─────┐      ┌──┴──┐       │  │  ┌──┼───│──┐   ┌─┴─┐  
(b,1)  (a,1) (b,3) (c,2)  (b,2) (c,1)     b  │  a  b   │  c   b   c   
                                          │  ├──│──│───│──│───│───┘
                                          └──┼──┘  └─┬─┘  └─┬─┘  
                                             1       3      2

Space complexity, however important, is only part of the picture. Often the limiting factor in many data structures is the maximum speedup of parallelization. While concurrent tries and dictionaries are available, they are nontrivial to implement and have suboptimal scaling properties. A much more trivially scalable approach would be to recursively decompose the data into many disjoint subsets, summarize each one, and recombine the summaries. By designing the summary carefully, this process can be made embarrassingly parallel.

Mathematically, the structure we are looking for is something called a monoid. If the summary of interest can be computed in any order it is called a commutative monoid. Many summaries naturally exhibit this property: sum, min, max, top-k and various probability distributions. Summaries which can be decomposed and recombined in this fashion are embarrassingly parallelizable.

                             abcbbbbccba ────────────────────────────────────────────┐
              ┌───────────────────┴───────────────┐                                  │
            abcbbb                              bbccba                               │
            (*,5)              +                (*,5)              =               (*,10)                  
  ┌───────────┼───────────┐               ┌───────┴────────┐            ┌────────────┼───────────────┐     
(a,1)       (b,3)       (c,1)  +        (b,3)            (c,2)     =  (a,1)        (b,6)           (c,3)   
  │        ┌──┴──┐        │         ┌─────┼─────┐       ┌──┴──┐         │      ┌─────┼─────┐      ┌──┴──┐  
(b,1)    (b,2) (c,1)    (b,1)  +  (a,1) (b,1) (c,1)   (b,1) (c,1)  =  (b,1)  (a,1) (b,3) (c,2)  (b,2) (c,1)

So far, we have considered exact methods. What if we didn't care about estimating the exact transition probability, but only approximating it. How could we achieve that? Perhaps by using a probabilistic data structure, we could reduce the complexity even further.

One approach would be to use an approximate counting algorithm, or sketch-based summary. Sketches are probabilistic datastructures for approximately computing some statistic efficiently. Without going into the details, sketching algorithms are designed to smoothly trade off error-bounds for space-efficiency and can be used to compute a summary statistic over a very large number of items. Even with a very low error tolerance, we can often obtain dramatic reduction in space complexity.

What about sample complexity? In many cases, we are not constrained by space or time, but samples. In many high-dimensional settings, even if we had an optimally-efficient sparse encoding, obtaining a faithful approximation to the true distribution would require more data than we could plausibly obtain. How could we do better in terms of sample efficiency? We need two things: (1) inductive priors and (2) learnable parameters. This is where algebraic structure, like groups, rings and their cousins come in handy.

If we squint a little, neural networks are a bit like a mergable summaries which deconstruct their inputs and recombine them in specific ways. For images, we have the special Euclidean group, SE(2). There are many other groups which are interesting to consider in various domains. By constructing our models with these invariants, we can recover latent structure with far, far fewer samples than would be required by a naive encoding scheme. For our purposes, we are particularly interested in semiring algebras, a specific kind of algebra that may be employed to compute many useful properties about graphs such as their longest, shortest and widest paths.

TODO.

Example

Suppose we have two Gaussian distributions with known parameters and want to combine them somehow:

How could we combine them to form a new distribution? We could simply average their densities:

But this might not be a valid operation depending on the units. We could "mix" them by flipping a coin:

But the mean of the mixture might not give the mean of the two datasets. Or we could multiply the PDFs:

Two Gaussian distributions, when multiplied together form another Gaussian! This is a very nice property.

Now we do not need to sample from the parents, but can discard them and sample directly from the child!

Combinatorial Properties

Stable distributions are closed under convolution and linear combination of their random variables.
A distribution is called infinitely divisible if it can be expressed as the sum of an arbitrary number of IID RVs.
Gaussian distributions form a monoid.

We can use these algebraic properties to significantly simplify certain mixture distributions.

See notebook for further implementation details.