ICFP 2019 Paper #105 Reviews and Comments
===========================================================================
Paper #105 From high-level inference algorithms to efficient code


Review #105A
===========================================================================

Overall merit
-------------
B. OK paper, but I will not champion it.

Reviewer expertise
------------------
Y. I am knowledgeable in the area, though not an expert.

Paper summary
-------------
This paper is on inference algorithms for probabilistic programs using arrays. It presents three different things: i) an extension of Carrette and Shan's symbolic simplifier towards programs using arrays, ii) a "histogram" transformation, which transforms programs using loops (and arrays) to using map-reduce expressions, iii) a full system and compilation pipeline for a probabilistic programming system that implements the concepts presented in the paper. Some benchmarks are also presented.

Comments for author
-------------------
Wow. On one hand, this is a quite impressive and potentially quite significant paper. On the other hand, it's one of the most technically dense papers I've seen in a while. 

I agree with the authors that arrays are one of the major constructs that lack good support in today's probabilistic programming languages, and I find the goal of the paper quite valuable. It looks as if the authors made significant progress that is worth publishing.

However, I've had major troubles with understanding the paper. If a reader isn't an expert in probabilistic programming and recent research on it, attempting to read the paper is a rather hopeless endeavour. The paper lacks focus, good examples, and precision. It's very hard to pinpoint particular single novel ideas.

I think what we have in this submission is basically three different papers packed into one super-dense paper that hence lacks focus. Sec. 3, 4 and 5 should in the opinion of this reviewer each be separate papers. Each of them is about a rather different problem; mixing all these issues in a single paper is a recipe for overwhelming complexity. As it is now, there's just not enough space in a single paper to explain all the things that are described in sufficient detail and precision.

Some more detailed comments:

- the lengthy citation style (e.g. on p.1) hinders readability
- p.1, how about explaining to the reader what "simplification", "latent variables", and "conjugate likelihoods" are?
- l.83, not clear why transformation to map-reduce style necessarily implies asymptotic complexity improvements
- Sec. 2 didn't do much for me. It informs me that there is some sort of compilation pipeline, but the whole presentation is much too high-level to yield any particular insights.
- In general, I think the introduction of the paper would benefit a lot from a concrete example involving arrays, together with a discussion of how existing PPL deal (or not) with such examples.
- l.137, you assume that every reader knows what "disintegration" is?
- l.288ff: I was not able to follow here.


Review #105B
===========================================================================

Overall merit
-------------
B. OK paper, but I will not champion it.

Reviewer expertise
------------------
Z. I am not an expert. My evaluation is that of an informed outsider.

Paper summary
-------------
This paper deals with the generation of efficient and accurate code from high-level specifications of probabilistic computations. From previous works, the solution proposed in the paper can handle (and simplify) probabilistic programs with arrays with arbitrary size, and can reuse simplification methods for the arrays underlying scalars. The paper also comes with a simplification techniques, based on map-reduce and loops reordering, code motion and static analysis, for optimizing arrays manipulations.

Comments for author
-------------------
This review is one of an outsider with respect to probabilistic programs optimization. However, from that perspective, i found the paper very well written, and was able to grasp the problem, understand (from a high-level view) the presented state of the art, and the intuition of the proposed solution. At my level, the paper seems sound and solid. Moreover, the paper comes with an evaluation section that seems to indicate that the solution is indeed relevant, as the proposed solution (and its implementation) generates code that compares to their highly optimized hand written counter-parts.


Review #105C
===========================================================================

Overall merit
-------------
A. Good paper. I will champion it at the PC meeting.

Reviewer expertise
------------------
X. I am an expert in the subject area of this paper.

Paper summary
-------------
Summary
--------

This system-description submission concerns Hakaru-with-arrays, an
aggressively optimizing compiler for a density-based probabilistic
programming language with exact inference. The array extension
is essential in applications that contain large datasets. The relevant
programming constructs (e.g., plates and additional static analysis
called 'unproduct') need to help the optimizing compiler 'see' that
different iterations of the same loop do not introduce unnecessary
dependencies between samples, as extra dependencies make inference
higher-dimensional, i.e., harder. The extension to support arrays also
involves extensions to the intermediate representations in the system,
including representations for arbitrary length integration. It also
involves a novel optimisation phase, called 'histogram', a
rule-rewriting-based transformation that uses a special IR, which
flattens out loops. Finally, the submission methodically compares the
proposed system with existing systems, and with itself, by evaluating
the relative speed-up each layer in the compilation pipeline delivers.

Comments for author
-------------------
I recommend accepting this submission.

A full-blown Bayesian inference system is a sophisticated optimising
compiler. As such, building such system can benefit from programming
language techniques, and moreover this problem domain imposes an
additional challenge to these techniques, as the 'correct' result is
not easy to see/test, and that's because it may involve integrals that are
difficult to solve by hand.

This submission describes such a full system. This system occupies an
interesting spot in the probabilistic systems design space for several
reasons:

1. It performs exact inference.
2. It can discover sophisticated conjugacy relations automatically.
3. It is simply typed.
4. It has support for arrays and conditionally-independent iteration.

These four properties single it out in the probabilistic programming
landscape.

Properties 2 and 4 are new in this submission, which extends the
original Hakaru system while maintaining properties 1 and 3. This
extensions adds a few source-level types and constructs, and, in equal
importance, several new IRs and transformation phases.

I found the overall contribution impressive, and its presentation is
well thought out and separated. The authors have thought about how
each feature fits in the overall architecture, and argued well for its
inclusion in the system.

For example, the fact that Dirichlet conjugacy relations can be
discovered automatically is impressive. This is something I heard
possible in folklore, but have never seen in a fully fledged system
with a clean design and a clear type-system.

In addition to the front-, and middle-, ends, I find the back-end to
be a solid contribution to the area. The back-end parts of the
compiler are fairly standard from a programming-language perspective,
but I consider injecting such a clean design into the probabilistic
programming community as an important contribution.

The comparison to other systems, and to sub-parts of the proposed
system, is solid, and highlights the relative strengths and tradeoffs.

Below I only have superficial comments to the authors, that should be
straightforward to address/take on board.


112-118: These comments on the use of exact inference in approximate
inference are interesting. Since you later (209-211) refer back to
them as justifying the usefulness of exact inference, consider
expanding with an example, either in this section, or at the end of
section 3.

225: Consider separating the reference to Pfeffer and Ramsey from the
reference to Giry, and tag the Pfeffer and Ramsey reference as your tutorial
of choice, or as a popular motivation for using monads in probabilistic 
modelling. 

273-277, and 285-288: suggestion --- consider binding the variables
next to the integrals they are relevant to, which may keep the formula
more similar to the source code (2):

$$\int dx \int dy \int dz \frac{e^{-\frac12 (x - \mu)^2}}{2\pi} \cdots$$


290-293: consider making the reference to holonomic representations
more self-contained, at least to give the reader a better idea what
they are and what are they good for. (The references you gave are a
good pointer to follow-up.) For example:

> Holonomic representations [Chyzak and Salvy 1998; Wilf and Zeilberger
> 1992], which involve differential equations with polynomial recurrence
> relations, let us represent a large class of functions with useful
> compositional closure properties, such as closure under integration,
> differentiation, and composition with algebraic functions. For more
> details, see Kauers [2013].

411: Wording is a bit weird, consider changing: "A Dirichlet
distribution [distributes] over arrays of numbers"

578: Consider explaining what happens if solving for j in $a = i$ fails. 
More generally, I found missing a discussion on whether the symbolic 
operations can ever fail, and if so, how does the programmer deal 
with this challenge. If you know a class of input programs for which 
Hakaru won't fail (for example, on all STAN models), that would make
 a strong addition.

1013: Typo "Our generated code is 9 times [faster than] MALLET".

------------------------------------------------------------------------
Post-response and discussion:

Besides updating some comments in my review above, 
I don't have mandatory requests for changes.

Making this article more accessible to non-experts and 
self-contained could go a long way in spreading your techniques.


Response by Chung-chieh Shan <ccshan@indiana.edu> (Author) (437 words)
---------------------------------------------------------------------------
We thank all the reviewers for their very helpful work.

**Review A**

Review A's fundamental complaint is that our paper includes too many
contributions for one paper. That is no reason not to accept it. It can
even be a reason to champion it.

The focus of the paper is to compile a functional DSL efficiently. This
goal necessitates and connects all the components in Sections 3--5, even
though they are drawn from multiple disparate communities. Breaking up
the paper, regardless of technical density, would strip it of motivation
and context and fall short of our goal:

- The symbolic mathematics of Section 3 by itself produces inefficient code.

- Similarly, the histogram transformation of Section 4 is precisely motivated
  by the output of Section 3.

- Finally, the optimizations shown in section 5 are tailored to the
  intermediate programs produced by Sections 3 and 4.

Take for example the comment in Review A about line 83. Indeed it is not
true in general that "transformation to map-reduce style necessarily
implies asymptotic complexity improvements". But such improvements do
accrue in the context of the preceding simplification transformation,
the following loop-optimizing code generation, and the inference
algorithms that we efficiently compile overall. In the ablation study
shown in Table 2, the histogram transformation and loop optimization
deliver <2x speedup separately but >100x speedup combined!

Although Sections 3--6 each make particular single novel contributions
listed on page 2, their coherence is underscored by the progression of
worked examples provided throughout. Section 3 explains "what
'simplification', 'latent variables', and 'conjugate likelihoods' are"
using 3 examples, of increasing complexity and culminating in the
classic Gaussian mixture model (GMM). The complex transformations in
Sections 4 and 5 are each detailed using a simple excerpt from the GMM
example. Section 6 shows quantitative measurements on GMM and other
examples.

"The lengthy citation style" is required by PACMPL
(https://pacmpl.acm.org/authors.cfm#author-year-citations).

**Review C**

We just want to clarify two points.

- We cite Ramsey and Pfeffer on line 225 and again on line 238 because
  (1) their paper recently popularized the Giry monad, and (2) like
  them, we interpret it both by sampling and by expectation.

- On line 578, if solving $a=i$ fails, then as a last resort unproduct
  returns $(H[e],1)$, which trivially satisfies the equational
  specification on line 573. Consequently, simplification still returns
  a semantically equivalent program as required, but not one in which a
  conjugacy is recognized or a variable is eliminated. In some cases,
  this is because there is really nothing to do; in other cases, our
  tool lags behind the mathematical prowess of professional applied
  statisticians. In an automated optimizing compiler, both possibilities
  will always remain.


Review #105D
===========================================================================

Overall merit
-------------
B. OK paper, but I will not champion it.

Reviewer expertise
------------------
X. I am an expert in the subject area of this paper.

Paper summary
-------------
This paper presents a compilation pipeline that generates collapsed Gibbs samplers for Bayesian models. As I understand it, this paper makes the following contributions:
1) The identification of a symbolic language that supports arbitrary (finite) dimension integrals and associated transformations (i.e., unproduct) that enable the reuse of existing computer algebra systems (CASs). This enables the system to marginalize out (collapse) variables.
2) A histograming transformation that enables the the generation of efficient sampling code for mixture models. A human writing sampling code for mixture models would do something very similar to the histograming transformation so it useful to automate this transformation.
3) The identification of loop optimizations that improve the efficiency of loops that one obtains from generating collapsed Gibbs samplers.

The main value of the paper as I see it is the application of a CAS and compilation techniques to generate efficient collapsed Gibbs samplers for high-dimensional Bayesian models with conjugacy relations. This improves the state-of-the-art for generating a specific kind of MCMC sampler that works well for models with conjugacy relations.

Comments for author
-------------------
Comments and Questions

- [Lines 223-224] Gaussian(mu, 1) >>= \x -> Gaussian(x, 1) >>= \y -> Gaussian(x, 1) >>= \z -> return (y, z) might be more readable.
- [Lines 239-243] Both equations (1) and (2) are in a form amenable for inference, just different kinds of inference.
- [Lines 274-276] Maybe introduce the concept of a pdf, and use something like pdf(x; Gaussian(mu, 1)) in all the examples. You're symbolically manipulating pdfs.
- [Lines 277] h should be a measurable function so it is not an arbitrary function.
- [Lines 273-313] There is some tension in the presentation of integration. On one hand, the grammar of expression (Figure 3) uses Riemann integration notation. On the other hand, the type of integrate(m, h) is that of Lebesgue integration. Just something to think about.
- [Lines 292-304] I need some more explanation here. What seems to be going on is that a holonomic representation is a good normal form for recognizing the density of a distribution so you still need to pattern match on a normal form. As you know, there are other normal forms such as rewriting densities in exponential form which is how conjugate relations are usually derived. How does the holonomic representation compare to representing distributions in exponential form?
- [Lines 391-403] I feel like a simpler way to express this is to say that you group the corresponding data with whichever cluster it is currently assigned to (which is what clustering is literally doing). The way it's presented currently, I'm trying to figure out why negative signs and squares are occurring (because I forgot that you are working with the Gaussian density with mean x[i] and standard deviation 1).
- [Lines 409-457] I'm on the fence about this point in that I cannot think of another standard primitive distribution that this would apply to besides the Dirichlet distribution. The decomposition of a k-dimensional Dirichlet distribution can also be done another (simpler) way: sample k i.i.d. gamma variables and return the normalized vector, in which case we do not have a chain of dependencies. Why not just do that? Is the general point that you want to find relations between random vectors and a finite set of i.i.d. random variables? How would this work with the decomposition of a multivariate Gaussian distribution into i.i.d. standard Gaussian distributions? Do you cast arrays of arrays into matrices?
- [Line 533] Typo: the purpose of the unproduct ...?
- [Figure 5] It may help to remind the reader that the high-level of what's going on is that you are factoring expressions with exponentials in them: exp(a) * exp(b) = exp(a + b) and that's why you have H^+ and H^x. It may also remind the reader that the summations occur due to mixture models because you are partitioning by the cluster assignment it is in.
- [Lines 585-663] As far as I understand, there are at least two things going on with the histograming operation. First, as you observe, it is important for mixture models because you want to accumulate information by iterating over the population and looking at the cluster as opposed to iterating over the clusters and the population. Second, collapsing in mixture models will trigger more instances of histograming. Intuitively, when you eliminate a parameter (by collapsing) in a mixture model, the information in that parameter gets distributed to and accumulated at each point's cluster assignment. This is reflected in the experiments where essentially all the speedup comes from histograming (1805x slowdown with no histograming as opposed to 1848x with no optimization). The general-purpose optimizer seems to be able to do the LICM. I think it is important to discuss the second point in the paper. 
- [Lines 684-685] Typo: r(e) --> r(i)?
- [Lines 842-847] I don't understand why the ability to preallocate memory is related to JITing. From my understanding, it has to do with the fact that you restrict the language to finite-dimension and fixed-structure graphical models and Gibbs sampling uses a bounded amount of memory in those models.
- [Lines 911-915] I think it's important to say that STAN works well on a very different class of models. A Bayesian Logistic Regression (not conjugate and no closed-form as far as I know) is a widely used model that STAN would be good at and the techniques described in this paper would not apply to. 
- [Lines 929-930] It is prudent to triple check the parameters of a Gaussian as some systems use standard deviation and others use variance.
- [Lines 1015-1021] It would be nice to compare against a C-Implementation of collapsed Gibbs sampling for LDA. How do you encode that different documents have different numbers of words?
-[Page 20] Given my understanding of the compilation pipeline, it should be possible to configure Hakaru to not collapse variables so it generates an ordinary Gibbs sampler. This way, you can indeed validate that collapsing gives the accuracy gains by comparing uncollapsed Hakaru version against collapsed Hakaru. Such a comparison might also help explain why AugurV2 is 10-100x faster than Hakaru if you see those improvements in speed in your own system by using a non-collapsed sampler.
- [Figure 10, Figure 11] I would also plot the accuracy obtained by the higher of a random or constant predictor. Ideally, you should also verify that the samplers implemented by each system is implementing the algorithm that you think it is. As you know, MCMC is notoriously hard to debug, so the difference may be because of a subtle bug or a different algorithm.
- [Figure 11, Figure 12] Just to double-check, am I looking at log-likelihood or log-predictive likelihood?
- [Figure 10, 11, 12] It is always better to run MCMC for as many iterations as possible. I'm not suggesting that you should draw 10000 samples in your experiments (because you aren't really using the results and this is not MCMC research) but hopefully > 10 samples and until the curves flatten out. It is also not clear if burnin was used. Systems such as STAN will use burnin to tune their samplers. So if no burnin was used, such systems will be placed at a disadvantage. Then again, the appeal of Gibbs in practice is that no tuning is required.
- [Footnote 5] The phrase "unsupervised classification task" is an oxymoron. On one hand, unsupervised implies that we have no labels. On the other, classification implies that we do have labels. It would be worth mentioning that this experiment is not how one would typically use a Gaussian Mixture Model.


Presentation and Accessibility

I would imagine that this paper would only be understandable by someone who has hand-implemented (collapsed) Gibbs samplers for mixture models. I think more explanation will better illuminate what was done and the limitations of the work. As far as I can tell, the techniques described in this paper really only work for mixture models with conjugacy relations. In particular, all benchmarks in the evaluation are mixture models with conjugacy relations. It's not clear from the paper what would happen if the system is applied to a model outside of that family class.

Something along the lines of giving an example of a Gibbs update equation and a collapsed Gibbs sampling equation (e.g., for the GMM) so the reader understands what the computations look like would also help. It would help make concrete when and where histograming applies. It would also highlight the tradeoffs: the reader can see that the collapsed Gibbs sampler samples from a lower dimensional space but variables that were once conditionally independent are now conditionally dependent. This improves the quality of samples provided that the algorithm can handle the dependencies, but makes implementing a parallel and/or distributed sampler more difficult.


Comment @A1
---------------------------------------------------------------------------
Here is the **author response to review 105D**.

We thank the reviewer for the valuable and detailed feedback on the
paper. We respond to detailed questions below, but we want to emphasize
one overall point. Our central contributions are modular tools and
techniques for compiling high-level inference algorithms to efficient
code. The value of our contributions is not only demonstrated by a
system that automates exact inference and collapsed Metropolis-Hastings
sampling, but also underscored by how often practitioners trying to
communicate and compose a far broader variety of inference algorithms
end up manually performing variable elimination, conjugacy recognition,
histogramming and loop optimizations. Our goal is not to hand to
probabilistic programmers a black-box full-scale compiler that
implements all those algorithms. For example, although the techniques
in Sections 3 and 4 are certainly applicable to Hamiltonian Monte
Carlo and variational inference, implementing those approaches or
integrating with a production system like STAN is outside the scope
of this paper. Rather, this paper removes several major obstacles in
the way of compiling all those algorithms efficiently, and validates
our approach by advancing the state of the art on multiple end-to-end
flagship applications. We agree that fully understanding all these
contributions requires familiarity with both compiler optimization
and Metropolis-Hastings sampling. By bringing together both kinds of
important tools, the paper rewards the sizable and growing community of
people at ICFP who should and do care about both.

Detailed responses:

> - [Lines 292-304] I need some more explanation here. What seems to be going
> on is that a holonomic representation is a good normal form for recognizing
> the density of a distribution so you still need to pattern match on a
> normal form. As you know, there are other normal forms such as rewriting
> densities in exponential form which is how conjugate relations are usually
> derived. How does the holonomic representation compare to representing
> distributions in exponential form?

Matching on a holonomic representation is far more robust than matching
on an exponential form: the coefficients of f(z),f’(z),... (such as
μ+y-2z and 3 on line 291) are polynomials in z, so we can use existing
algorithms for matching polynomials that are efficient and robust. For
example, the coefficient ratio (μ+y-2z)/3 is always a rational function
in z, so we use Euclid’s algorithm to normalize it. (This step is part
of Carette and Shan’s scalar simplifier [2016, Section 5.3] and not
novel, but we will revise this paragraph to explain it in more detail.)

> - [Lines 409-457] I'm on the fence about this point in that I cannot think
> of another standard primitive distribution that this would apply to besides
> the Dirichlet distribution. The decomposition of a k-dimensional Dirichlet
> distribution can also be done another (simpler) way: sample k i.i.d. gamma
> variables and return the normalized vector, in which case we do not have a
> chain of dependencies. Why not just do that?

Expressing Dirichlet in terms of Gamma rather than Beta seems simpler
at first glance, but every element of the normalized vector depends
on every element of the unnormalized vector, so it actually becomes
more difficult to integrate out $\vec{\theta}$ and to recognize its
conjugacy with the categorical variables $\vec{y},z$. (We will add this
explanation to footnote 2.)

> Is the general point that you
> want to find relations between random vectors and a finite set of i.i.d.
> random variables? How would this work with the decomposition of a
> multivariate Gaussian distribution into i.i.d. standard Gaussian
> distributions? Do you cast arrays of arrays into matrices?

Indeed, expressing Dirichlet in terms of Beta is an instance of
parameterizing a distribution in terms of i.i.d. random variables. In
particular, it is promising to decompose a multivariate Gaussian into
i.i.d. standard Gaussians, but we have only tried it (successfully) with
known (non-diagonal) covariance matrices. Those matrices can indeed be
expressed as arrays of arrays.

> - [Line 533] Typo: the purpose of the unproduct ...?

Fixed; thanks.

> - [Lines 585-663] As far as I understand, there are at least two things
> going on with the histograming operation. First, as you observe, it is
> important for mixture models because you want to accumulate information by
> iterating over the population and looking at the cluster as opposed to
> iterating over the clusters and the population. Second, collapsing in
> mixture models will trigger more instances of histograming. Intuitively,
> when you eliminate a parameter (by collapsing) in a mixture model, the
> information in that parameter gets distributed to and accumulated at each
> point's cluster assignment. This is reflected in the experiments where
> essentially all the speedup comes from histograming (1805x slowdown with no
> histograming as opposed to 1848x with no optimization). The general-purpose
> optimizer seems to be able to do the LICM. I think it is important to
> discuss the second point in the paper.

We agree on both points and will expand their discussion at the
beginning of Section 4. We hasten to clarify that our code generator
performs backend optimizations much more aggressively than a
“general-purpose optimizer” can. It is thanks to domain-specific
properties of Hakaru programs (lines 761-765) that our optimizations
remain sound and profitable. Seeing as the top 3 rows of Table 2 are
all uncompetitive and within a factor of 2 of each other, our ablation
study does not support crediting any single contribution of our paper
with “essentially all the speedup”. Rather, the parts of the paper
provide motivation and context for each other.

> - [Lines 684-685] Typo: r(e) --> r(i)?

Actually r(e) is correct here, and r(i) would use i out of scope.
(The variable i takes scope over r(i) so that we can express, for example,
$\textsf{Idx}^{b_1}_{i_1}(e_1, \textsf{Idx}^{b_2(i_1)}_{i_2}(e_2, \ldots))$
to produce a histogram that is a ragged array of arrays. The inner size
$b_2(i_1)$ cannot be replaced by $b_2(e_1)$, because $e_1$ might well
depend on the loop index $j$, and the typing rule for Idx in Figure 6
intentionally prevents the histogram size $b$ from depending on the loop
index $j$---after all, the histogram is initialized before the loop
commences. We will add this explanation.)

> - [Lines 842-847] I don't understand why the ability to preallocate memory
> is related to JITing. From my understanding, it has to do with the fact
> that you restrict the language to finite-dimension and fixed-structure
> graphical models and Gibbs sampling uses a bounded amount of memory in
> those models.

Because we wait until not only receiving the data but also allocating
intermediate arrays before generating code, the machine code we emit
embeds not only the sizes but also the addresses of the arrays as
constants.

> - [Lines 1015-1021] It would be nice to compare against a C-Implementation
> of collapsed Gibbs sampling for LDA. How do you encode that different
> documents have different numbers of words?

We used two one-dimensional integer-arrays of equal length, one
containing word identifiers and one containing document identifiers. We
could as well have used a single ragged array of integer-arrays, where
each inner array contains the word identifiers that make up a document.

> - [Figure 11, Figure 12] Just to double-check, am I looking at
> log-likelihood or log-predictive likelihood?

No data was held out for LDA, so Figure 12 plots log-likelihood. Figure
11 plots the log-likelihood of the entire sample, which consists of
supervised classifications and inferred classifications (which are
correlated due to collapsing away the continuous latent variables).

> - [Figure 10, 11, 12] It is always better to run MCMC for as many
> iterations as possible. I'm not suggesting that you should draw 10000
> samples in your experiments (because you aren't really using the results
> and this is not MCMC research) but hopefully > 10 samples and until the
> curves flatten out. It is also not clear if burnin was used. Systems such
> as STAN will use burnin to tune their samplers. So if no burnin was used,
> such systems will be placed at a disadvantage. Then again, the appeal of
> Gibbs in practice is that no tuning is required.

Each STAN sample takes on the order of 1 second to produce, and STAN
defaults to 1000 burnin samples, so if we waited for STAN to finish
burnin before starting the clock in Figure 10, then Table 1 would show
about 1000 seconds of GMM-STAN startup time. Instead, to show STAN’s
decent performance and quick startup, Figure 10 includes STAN burnin,
so it only plots the first couple dozen burnin samples. However, on the
advice of a core STAN developer, we tried reducing the burnin duration
from 1000 samples to between 0 and 10 samples, and it did not improve
the overall picture in Figure 10. Core STAN developers also advised
us that there is no principled way to determine how much burnin is
appropriate for this model.

> - [Footnote 5] The phrase "unsupervised classification task" is an
> oxymoron. On one hand, unsupervised implies that we have no labels. On the
> other, classification implies that we do have labels. It would be worth
> mentioning that this experiment is not how one would typically use a
> Gaussian Mixture Model.

We will change “unsupervised classification” to “clustering”.


Review #105E
===========================================================================

Overall merit
-------------
B. OK paper, but I will not champion it.

Reviewer expertise
------------------
X. I am an expert in the subject area of this paper.

Paper summary
-------------
The paper describes multiple extensions to the Hakaru probabilistic programming system.

The authors introduce new language features that can express variable-length arrays of random variables.
The programming language remains limited to generative models with a constant-depth conditional dependency structure.

Hakaru source terms are then translated to linear expressions over test functions, as usual, but the intermediate term language is extended such that it contains product expressions and integrals over arrays.

Integrals over arrays are simplified by attempting to transform them into products over one-dimensional integrals using the newly introduced unproduct operation.
The resulting terms may still contain variadic products and sums. Those are optimized for (possibly asymptotically) faster computation using the author's histogram transformation together with a few more traditional compiler optimizations.

Finally, aggressive JIT compilation techniques further contribute to making the end-to-end approach competitive with handwritten inference algorithms, such that it beats other generic approaches on important metrics.

Comments for author
-------------------
Strengths:
+ A working approach for symbolic simplification of an expressive class of probabilistic programs with random arrays.
+ The unproduct and histogram transformations are simple and effective.
+ An end-to-end inference system that is made practical using special-purpose JIT compilation with an LLVM backend.

Weaknesses:
- Some parts of the paper were painful to read because of bad notation and missing/badly structured explanations.
- The comparison to existing systems could have been more accurate.
- The limitations of the approach are not discussed very precisely.


Detailed comments:

1: The title seems inaccurate. The input to the approach is a generative probabilistic model, not an inference algorithm.

Page 2:

63: "or unroll arrays entirely at prohibitive performance cost (as in PSI [Gehr et al. 2016])"
PSI actually does not unroll arrays, but it does not yet have variadic products and integrals, which would be needed to express populating those arrays with a variable number of random choices.

E.g., the following PSI program gives a closed-form result, even though the array has a symbolic length:
```
$ cat counterexample.psi
def main(n){
    x := array(n,0);
    x[uniformInt(0,n-1)] = uniform(0,1);
    x[uniformInt(0,n-1)] = 2*x[uniformInt(0,n-1)];
    return x[uniformInt(0,n-1)];
}
$ psi counterexample.psi
p(xᵢ|n) = ((-n²+n³)·[-1+xᵢ≤0]·[-xᵢ≤0]+(-n³+n⁴)·δ(0)[xᵢ]+1/2·[-2+xᵢ≤0]·[-xᵢ≤0]·n²)·[-n+1≤0]·[-⌊n⌋+n=0]·⅟n⁴
Pr[error|n] = [-1+n≤0]·[-n+1≠0]·[-⌊n⌋+n=0]+[-⌊n⌋+n≠0]
```

This clearly contradicts your claim that PSI unrolls all arrays.

67: "We show that these limitations are not necessary."
I think that was already clear (as techniques to overcome them are well-known). Your contribution is that you are the first to show a practical automated method (which you argue handles a large enough subset of all interesting cases).

Page 3:
119: "The starting point is a probabilistic program that expresses the desired inference algorithm by denoting the conditional distribution to calculate and possibly sample."
This actually expresses a probabilistic model, not an inference algorithm.

Page 4:
185: "The measure monad." Which measure monad?

Page 8:
370: The gaussian mixture example is unecessarily obfuscated. "Suppose we would like to model data points drawn from a mixture of m normal distributions" does not explain what follows well enough.

I would explicitly say that for illustrative purposes, you sample n+1 data points, where n of them are in arrays, and the last one is a single value.
Furthermore, just tell the reader that θ is the array of cluster probabilities, x are the cluster centers, y,z are the cluster assignments and s,t are your data points.


Page 9:
428: I suggest you replace the "..." in the parameters to the Beta distributions by actual expressions (to clarify that they depend on i). As far as I understand, they could simply be α⃗[i] and β⃗[i].

444: The formula for step 1 is similarly unclear. "^(...)" is not good notation.

454: Again, name the exponents, and name them *differently* to clarify that we are not simply looking at an integral over "step 1".

491: Fig. 4 is a huge jungle of ellipses. I don't think it is easy enough for a casual reader to figure out what your notation is supposed to mean.

506: "When integrate arrives at a primitive distribution such as Gaussian, it generates an integral whose body nests as many definite products as the list is long"

Maybe use an explicit normalization pass, for example:

Plate(n,i, Bind(Gaussian(0,1),x,Uniform(x,x+1)))
 → Bind(Plate(n,i,Gaussian(0,1)),x,Plate(n,i,Uniform(x[i],x[i]+1)))

558: You say: "We distinguish between heaps of two modes by what they distribute over".
A function f: A→A distributes over an operation op: A×A→A if f(a op b) = f(a) op f(b). Note that for it to be a "distributive" law, the operation has to be the same on both sides.

571: "The multiplications in (22) are not typos, because our operation is unproduct, not unsum."
This reads like an unfriendly response to a reader who got confused because of the previous point.

575: "The second case is the workhorse;"
Even with that being the case, there is no discussion of how it actually works.

579: How can x⃗[k] even be matched by the second case? There is neither a product nor a sum.

585: "asymptotic time complexity of loops" → "asymptotic running time of loops"

655: The description of the semantics of (24) does not explain how it arises from its components.

670: I would move the definition of Hist here. Furthermore, maybe you should merge the definitions of the reducers with the intuitive explanation of their semantics and clarify earlier that initialization is done with the neutral element r^0 and updates are performed using r^1 and the monoid operation.

932: The diagrams are hard to read. It is not easy to see which curve represents which system.

1075: It appears you are not using exact numbers for "exact" inference.
      Inference with floats is a reasonable kind of approximate inference, but by default, PSI uses (a slow implementation of) arbitrary-precision rational numbers.
      Furthermore, while the asymptotic behavior of the PSI inference is similar to what I observe, the absolute numbers presented in the paper seem a bit high.
      Using my laptop, I can run 60 ClinicalTrial data points in only about 150s. Did you use ./build-release.sh and passed the --nocheck flag?
      Maybe you can also mention that at the moment, PSI doesn't do any AOT or JIT compilation and is purely based on symbolic simplification.


Comment @A2
---------------------------------------------------------------------------
Here is the **author response to review 105E**.

We thank this reviewer as well for the valuable and detailed feedback on
the paper. We reiterate that our central contributions are not just a
compilation pipeline but its constituent program transformations, which
are "simple and effective" and enable automatic, efficient compilation
of high-level inference algorithms. The Hakaru system validates our
"working approach", but the tools and techniques we introduce are also
applicable to other systems such as PSI.

> Detailed comments:
>
> 1: The title seems inaccurate. The input to the approach is a generative
> probabilistic model, not an inference algorithm.
> Page 3:
> 119: "The starting point is a probabilistic program that expresses the
> desired inference algorithm by denoting the conditional distribution to
> calculate and possibly sample."
> This actually expresses a probabilistic model, not an inference algorithm.

We meant an inference algorithm, such as a Gibbs sampling step. An
innovation in probabilistic programming pioneered by Hakaru (Narayanan
et al., FLOPS 2016) is to express both probabilistic models and
inference algorithms as programs in a single language of measures.
For example, both for modeling and for inference, it is useful to
construct Markov chains, sequence random choices, sample from a given
distribution, and perform deterministic computations. The probabilistic
programs whose efficient compilation is automated for the first time
in this paper are "algorithms whose idiomatic expression requires
[symbolically sized] random array variables that are latent or whose
likelihood is conjugate", including but not limited to Gibbs sampling
and exact inference (line 110+). This "idiomatic expression" is labeled
"conditional distribution" near the top of Figure 1 and is the starting
point of this paper (line 119+). Turning a model into an approximate
or exact inference algorithm is part of the Hakaru system and approach
(Shan & Ramsey, POPL 2017; Zinkov & Shan, UAI 2017; Narayanan & Shan,
ICFP 2017) but outside the scope of this paper (line 126+).

> Page 2:
> 63: "or unroll arrays entirely at prohibitive performance cost (as in PSI
> [Gehr et al. 2016])"
> PSI actually does not unroll arrays, but it does not yet have variadic
> products and integrals, which would be needed to express populating those
> arrays with a variable number of random choices.

Indeed, we will correct our paper to state that it is not arrays but
random choices that PSI needs to unroll in order to simplify.

> Page 4:
> 185: "The measure monad." Which measure monad?

Line 226+ makes this specific. The precise semantics of Hakaru programs
is defined in terms of s-finite kernels, as introduced by Staton (ESOP
2017).

> 575: "The second case is the workhorse;"
> Even with that being the case, there is no discussion of how it actually
> works.
> 579: How can x⃗[k] even be matched by the second case? There is neither a
> product nor a sum.

We agree that we need to give more detail in Figure 5. In particular,
the second case handles unproduct(e,x⃗,i,H) when the term e uses the
array x⃗ at only one index a(j), whose innermost-scoped free variable
is j. Instead of using brittle syntactic pattern-matching to check
whether "H has the form H₁[∏ⱼH₂] or H₁[∑ⱼH₂]", we first make sure that
"solving for j in the equation i=a(j) yields the equivalent equation
j=b(i)", then return the expression pair (1,H’[e(x⃗[i])]) where H’ is
EITHER the result of replacing ∏ⱼ or ∑ⱼ in H by "let j=b(i) in" OR the
conditional "if i=a(j) then H[] else 1". The latter conditional arises
due to algebraic simplifications that occur while "plugging an
expression into a heap" (lines 557-558).

> 1075: It appears you are not using exact numbers for "exact" inference.

We do use exact numbers for exact inference during both the
simplification and histogram phases -- no floating point is used there.
When we obtain a closed form from just those phases, it is indeed
"exact". To generate the numerical code whose performance is measured in
Section 6, we agree that what we call "exact inference" is only exact
under the pretense that floating-point computations on real numbers are
exact. For example, to avoid underflow, the type ℝ⁺ on line 246 is
compiled into floating-point in log-space, a well-known necessity.

>       Furthermore, while the asymptotic behavior of the PSI inference is similar
> to what I observe, the absolute numbers presented in the paper seem a bit
> high.
>       Using my laptop, I can run 60 ClinicalTrial data points in only about 150s.
> Did you use ./build-release.sh and passed the --nocheck flag?

We did not use ./build-release.sh and pass the --nocheck flag. These
changes in how we invoke PSI reduced the running times shown in Figure
13 uniformly to about 45%. (The apparent superlinear slowdown remains.)
Thanks again.