gprob note

Sergey Fedorov
Quantop, Niels Bohr Institute
Copenhagen, Denmark
(October 17, 2024)

Probabilistic programs define statistical models as compositions of primitive distributions conditioned on observations. Inference of general conditional distributions is a hard task. The computational overhead of general inference methods, such as Markov chain or Hamiltonian Monte-Carlo, can make it difficult for probabilistic programs to compete with problem-specific approaches. For Gaussian random variables, however, this is not necessarily the caseโ€”as the composition of distributions and their conditioning can be performed analytically, the overhead for their computation can be minimal.

This note presents the background theory for gprob, a probabilistic programming package that specializes to Gaussian random variables.

1 Distributions defined via linear maps.

Any Gaussian probabilistic programย [1] can be expressed as a linear map of a number of independent identically distributed (iid) normal variables ei with zero mean and unit variance,

eiโˆผ๐’ฉโข(0,1),i=1,โ€ฆ,n. (1)

In gprob, all random arrays are represented in this way โ€” as affine maps of latent variables. E.g., a random array named x indexed by a multi-dimensional index iโขjโขkโขโ€ฆ is represented as

xiโขjโขkโขโ€ฆ=โˆ‘ฮปeฮปโขaฮป,iโขjโขkโขโ€ฆ+biโขjโขkโขโ€ฆ, (2)

where {eฮปโˆผ๐’ฉโข(0,1)}, ฮปโˆˆโ„• is a set of latent univariate normal variables, and a and b are numerical arrays with real or complex coefficients. The mean of xiโขjโขkโขโ€ฆ equals biโขjโขkโขโ€ฆ, and the covariance is

Ciโขjโขkโขโ€ฆ,rโขsโขtโขโ€ฆ=โˆ‘ฮปaฮป,iโขjโขkโขโ€ฆโขaฮป,rโขsโขtโขโ€ฆโˆ—, (3)

where rโขsโขtโขโ€ฆ run over the same set of indices as iโขjโขkโขโ€ฆ and โˆ— denotes complex conjugation.

New latent variables are allocated during calls to normal(). For example, normal(0, 1) produces one latent variable, and normal(size=sz), where sz is integer, produces sz new latent variables. A global inventory of all latent variables is maintained, but their joint distributions are only materialized within individual arrays. This has an effect on the memory footprint. Roughly speaking, if one random variable takes m bytes, then two arrays of n variables each will take mร—(2ร—n2) bytes, and not mร—(2ร—n)2 bytes.

Linear operations in gprob, such as addition or concatenation, produce new random arrays. The set of the latent variables of the result is a union of the sets of the latent variables of the operands. The b of the operation result is simply the operation applied to the bs of the operands. The a of the operation result is found by first extending the as of the operands along their first dimension by zero entries, such that they all map the same union vector of latent variables, and then performing the operation iteratively over the first dimension of the extended as. (In practice, no iteration takes place, but the operations on as are vectorized.) To give a concrete example, the sum of two one-dimensional random variables ๐ฑ and ๐ฒ defined by the maps written in the matrix form as

๐ฑ=(1203)โข(e1e2)+(45),๐ฒ=(1230)โข(e2e3)+(โˆ’71), (4)

is another variable, ๐ณ=๐ฑ+๐ฒ, whose map is

๐ณ=(132060)โข(e1e2e3)+(โˆ’36). (5)

2 Conditioning.

In a probabilistic language that maintains a joint distribution over all latent variables, such as the one presented in Ref.ย [1], conditioning is a statement that updates the latent distribution. One can think of the conditioning in this case as a post-selection to the realizations of the latent variables that ensures that all the conditions are fulfilled. In contrast, in gprob, conditioning of x on y=0 is an expression that produces a new random variable, whose latent space is the union of the latent spaces of x and y. In this case, one can think of the conditioning as creating a version x that is feed-forward corrected based on the measurement of y.

To show how the conditioning is implemented, we consider the one-dimensional real case. Any random variable can be transformed to this representation by flattening and, if the variable is complex, by separating its real and imaginary parts. Let the conditioned variable be ๐ฑ, the condition be ๐ฒ=๐ŸŽ, and the result of conditioning be ๐ณโ‰ก๐ฑ|(๐ฒ=๐ŸŽ). All more general cases can be reduced to this: if a non-zero value of ๐ฒ was observed, this value can be subtracted from ๐ฒ, and if there is more than one observation, they can be stacked together in one ๐ฒ vector.

The variables ๐ฑ and ๐ฒ take values in โ„dx and โ„dy, and are represented by the maps of the same vector of latent variables ๐ž=(e1,โ€ฆ,en),

๐ฑ=๐€xTโข๐ž+๐›x,๐ฒ=๐€yTโข๐ž+๐›y, (6)

where ๐€x and ๐€y are, respectively, nร—dx and nร—dy matrices, and ๐›x and ๐›y are dx and dy-size mean vectors. The rank of the condition matrix, ๐€y, is assumed to be full and equal to dyโ‰คn, which means that the conditions are neither incompatible nor tautological. If the rank of ๐€y is less than dy, yet the conditions are compatible, some of them can be eliminated from ๐€y.

Conditioning splits the latent space into two orthogonal subspacesโ€”one in which the allowed values of the latent variables are fully determined to yield the observed value of ๐ฒ, and one in which the values of the latent variables are unconstrained. The conditional mean is found by evaluating ๐ฑ at the value of the latent vector ๐ž~ from the first subspace, which satisfies

๐€yTโข๐ž~+๐›y=0. (7)

This value can be formally found as

๐ž~=โˆ’(๐€y+)Tโข๐›y, (8)

where ๐€y+ is the Mooreโ€“Penrose pseudoinverse of ๐€y. (As ๐€y is normally not a square matrix, it does not have an inverse). To obtain ๐ž~ in practice, it is more convenient to directly solve the system of equations in Eq.ย (7) in the least square sense. The mean of the conditional variable ๐ณ, ๐›z, is then given by

๐›z=๐›xโˆ’๐€xTโข(๐€y+)Tโข๐›y. (9)

The conditional map, ๐€z, is found by left-multiplying ๐€x by the projector on the latent subspace unaffected by the conditioning, i.e. the subspace orthogonal to the columns of the constraint matrix ๐€y. The operator that performs the desired projection is ๐ˆโˆ’๐€yโข๐€y+, where ๐ˆ is the identity matrix, and the conditional map is

๐€z=๐€xโˆ’๐€yโข๐€y+โข๐€x. (10)

One can verify that the conditional mean and covariance that follow from Eqs.ย (9-10) agree with those obtained based on the momentum representation of ๐ฑ and ๐ฒ (given in Appendix A). To make this comparison, we can use the facts that the cross-covariance matrix between ๐ฑ and ๐ฒ is

๐‚xโขy=๐€xTโข๐€y, (11)

that ๐=๐€yโข๐€y+ is an orthogonal projector satisfying ๐2=๐ and ๐T=๐, and also that for a full-rank matrix ๐€y the block ๐‚yโขy=๐€yTโข๐€y is invertible, with its inverse being

๐‚yโขyโˆ’1=๐€y+โข(๐€y+)T. (12)

3 Causal conditioning of time series. Wiener filtering.

Sec.ย 2 described the conditioning of all elements of one random vector ๐ฑ on all elements of another vector ๐ฒ. There is also another, more specialized, problem that arises sometimesโ€”the conditioning of individual elements xi, for i=1,..,dx, on subsets {yj:jโ‰คsi} for some non-decreasing set of integer indices si. This problem appears, in particular, in the causal analysis of time series via Kalman and Wiener filtering, where the elements of ๐ฑ and ๐ฒ correspond to a range of sequential moments in time. (For an introduction to filtering see, for example, Ref.ย [2]). The time stamps on ๐ฑ and ๐ฒ are non-decreasing, but not necessarily equally separated, and can also repeat. The task can be for each xi to find its distribution conditioned only on those observations {yj} that came before xi. We, therefore, call the corresponding problem โ€œcausal conditioningโ€.

Causal conditioning can be performed trivially using the result of Sec.ย 2 by iterating over the elements of ๐ฑ and each time selecting an appropriate range from ๐ฒ. This would be, roughly speaking, dx time more computationally expensive than the simple conditioning of all xi on all yj. Fortunately, there is an algorithm that allows to causally condition all elements of ๐ฑ in one go.

To introduce this algorithm, we first note that in the case of regular conditioning both the conditional mean and map (given by Eq.ย (9) and Eq.ย (10), respectively), could be expressed using a dyร—dx matrix ๐,

๐=๐€y+โข๐€x (13)

as

๐›zT=๐›xTโˆ’๐›yTโข๐,๐€z=๐€xโˆ’๐€yโข๐. (14)

The i-th column of ๐ consists of the coefficients that give the closest in Euclidean distance approximation of the i-th column of ๐€x via a linear combination of the columns of ๐€y. From this point of view, it should be clear that for vectors whose components are indexed left to right, ๐ฑ=(x1,โ€ฆ,xdx) and ๐ฒ=(y1,โ€ฆ,ydy), a causal structure can be enforced in conditioning by constraining the closest-approximation matrix ๐ to be upper-triangular,

๐=(ฮฒ11ฮฒ12โ‹ฏฮฒ1โขdxโ‹ฎโ‹ฎโ‹ฑโ‹ฎฮฒs1โข1ฮฒs1โข2โ‹ฏฮฒs1โขdx0ฮฒ(s1+1)โข2โ‹ฏฮฒ(s1+1)โขdxโ‹ฎโ‹ฎโ‹ฑโ‹ฎ0ฮฒs2โข2โ‹ฏฮฒs2โขdx00โ‹ฏฮฒ(s2+1)โขdxโ‹ฎโ‹ฎโ‹ฑโ‹ฎ00โ‹ฏฮฒdyโขdx). (15)

Here, the list of indices si is a non-decreasing sequence of integers from the range [1,dy] that is tailored to the causal structure of the problem. For example, in causal Wiener filtering of a scalar random process xโข(t) with a scalar noisy measurement yโข(t), discretized as xi and yi, we could put si=iโˆ’1 to use all observations before the time ti for the prediction of xโข(ti).

What remains is to find a way of obtaining causally-constrained matrices ๐. Towards this goal, we introduce the QR decomposition of ๐€y,

๐€y=๐๐‘, (16)

where ๐ is a nร—dy matrix whose columns is an orthonormal set of vectors, and ๐‘ is a dyร—dy non-degenerate upper triangular matrix. The left-multiplication of ๐ by ๐‘ preserves the triangular structure (meaning that the elements of ๐ that are zero in Eq.ย (15) are also zero in the multiplication product), and, therefore, the least-square solution of the system of equations

๐€yโข๐=๐€x, (17)

with the desired causal constraint on ๐ can be expressed as

๐=๐‘โˆ’1โขโ„ณโข[๐Tโข๐€x], (18)

where โ„ณโข[โ€ฆ] a masking operator that zeros for its argument โ€ฆ the matrix elements that are zero in Eq.ย (15). Finally, the causally conditioned mean and map are given by

๐›zT=๐›xTโˆ’๐›yTโข๐‘โˆ’1โข๐โขโ„ณโข[๐โ€ โข๐€x], ๐€z=๐€xโˆ’๐โขโ„ณโข[๐โ€ โข๐€x]. (19)

A similar analysis can be performed when ๐ is generalized lower-triangular, which in the Wiener filtering theory corresponds to anti-causal filtering. The main difference in this case is that ๐€y needs to be decomposed as ๐โ€ฒโข๐‹, where ๐‹ is lower-triangular.

Appendix A: Gaussian distributions defined via moments.

The probability density function of a d-dimensional Gaussian random variable with the mean vector ๐ฆ=(m1,โ€ฆ,md) and covariance matrix ๐‚ to take the value ๐ฑโˆˆโ„d is given by

pโข(๐ฑ)=(2โขฯ€)โˆ’d/2โข(detโข(๐‚))โˆ’1/2โขexpโข(โˆ’12โข(๐ฑโˆ’๐ฆ)Tโข๐‚โˆ’1โข(๐ฑโˆ’๐ฆ)), (A1)

where ๐‚โˆ’1 is the inverse of the covariance. If ๐‚ is degenerate and thus non-invertible, the possible realizations of the random variable do not cover the whole space โ„d. Our convention when defining the probability density function for such variables is to assign pโข(๐ฑ)=0 to impossible samples, and, when dealing with possible samples, to use the Mooreโ€“Penrose pseudoinverse ๐‚+ instead of ๐‚โˆ’1 and the determinant of the non-degenerate projection of ๐‚ instead of detโข(๐‚) in Eq.ย (A1). This is consistent with the way degenerate distributions are handled in scipy.statsย [3].

If the random variable consists of two stacked vectors, ๐ฑ with the dimension dx, and ๐ฒ with the dimension dy, and ๐‚ is their joint covariance matrix expressed in the block form as,

๐‚=(๐‚xโขx๐‚xโขy๐‚yโขy๐‚yโขy), (A2)

the conditional probability density of ๐ฑ given ๐ฒ=๐ฒ~ is Gaussian with the mean and covariance given by

๐ฆc=๐ฆx+๐‚xโขyโข๐‚yโขyโˆ’1โข(๐ฒ~โˆ’๐ฆy), (A3)
๐‚c=๐‚xโขxโˆ’๐‚xโขyโข๐‚yโขyโˆ’1โข๐‚yโขx. (A4)

Appendix B: Parametric derivatives and Fisher information.

A number of information-theoretic properties of distributions whose mean and covariance depend on some parameters ฮธ are expressed via the logarithm of the probability density and its derivatives. For non-degenerate Gaussian distributions, they are given, respectively, by

lnโขpโข(๐ฑ)=โˆ’12โขฮ”โข๐ฑTโข๐‚โˆ’1โขฮ”โข๐ฑโˆ’12โขlnโขdetโข(๐‚), (B1)

and

โˆ‚lnโขpโข(๐ฑ)โˆ‚ฮธi=โˆ‚๐ฆTโˆ‚ฮธiโข๐‚โˆ’1โขฮ”โข๐ฑ+12โขฮ”โข๐ฑTโข๐‚โˆ’1โขโˆ‚๐‚โˆ‚ฮธiโข๐‚โˆ’1โขฮ”โข๐ฑโˆ’12โขTrโข[๐‚โˆ’1โขโˆ‚๐‚โˆ‚ฮธi], (B2)

where

ฮ”โข๐ฑ=๐ฑโˆ’๐ฆ. (B3)

The expression for the derivatives was obtained using โˆ‚๐‚โˆ’1/โˆ‚ฮธ=โˆ’๐‚โˆ’1โข(โˆ‚๐‚/โˆ‚ฮธ)โข๐‚โˆ’1, and the fact that the matrix determinant equals the product of the eigenvalues. The components of the Fisher information matrix Fiโขj are given by (e.g.ย [4])

Fiโขj=โˆ‚๐ฆTโˆ‚ฮธiโข๐‚โˆ’1โขโˆ‚๐ฆโˆ‚ฮธj+12โขTrโข[๐‚โˆ’1โขโˆ‚๐‚โˆ‚ฮธiโข๐‚โˆ’1โขโˆ‚๐‚โˆ‚ฮธj], (B4)

and related to the log probability density as

Fiโขj=โŸจโˆ‚lnโขpโข(๐ฑ)โˆ‚ฮธiโขโˆ‚lnโขpโข(๐ฑ)โˆ‚ฮธjโŸฉ=โˆ’โŸจโˆ‚2lnโขpโข(๐ฑ)โˆ‚ฮธiโขโˆ‚ฮธjโŸฉ. (B5)

For completeness, here is also the expression for the second derivatives of the log probability density,

โˆ‚2lnโขpโข(๐ฑ)โˆ‚ฮธiโขโˆ‚ฮธj= โˆ’Fiโขj+โˆ‚2๐ฆTโˆ‚ฮธiโขโˆ‚ฮธjโข๐‚โˆ’1โขฮ”โข๐ฑ
+Trโข[(12โข๐‚โˆ’1โขโˆ‚2๐‚โˆ‚ฮธiโขโˆ‚ฮธjโข๐‚โˆ’1+๐‚โˆ’1โขโˆ‚๐‚โˆ‚ฮธiโขโˆ‚๐‚โˆ’1โˆ‚ฮธj)โข(ฮ”โข๐ฑโขฮ”โข๐ฑTโˆ’๐‚)]. (B6)

References

  • [1] D.ย Stein and S.ย Staton, โ€œCompositional Semantics for Probabilistic Programs with Exact Conditioning,โ€ in 2021 36th Annual ACM/IEEE Symposium on Logic in Computer Science (LICS), Jun. 2021, pp. 1โ€“13. https://ieeexplore.ieee.org/document/9470552
  • [2] R.ย G. Brown, Introduction to Random Signal Analysis and Kalman Filtering.ย ย ย Wiley, 1983.
  • [3] Scipy reference scipy.stats.multivariate_normal.
  • [4] Y.ย Akimoto, Y.ย Nagata, I.ย Ono, and S.ย Kobayashi, โ€œTheoretical foundation for CMA-ES from information geometric perspective,โ€ Jun. 2012. http://arxiv.org/abs/1206.0730