gprob note

Sergey Fedorov
Quantop, Niels Bohr Institute
Copenhagen, Denmark

(October 17, 2024)

Probabilistic programs define statistical models as compositions of primitive distributions conditioned on observations. Inference of general conditional distributions is a hard task. The computational overhead of general inference methods, such as Markov chain or Hamiltonian Monte-Carlo, can make it difficult for probabilistic programs to compete with problem-specific approaches. For Gaussian random variables, however, this is not necessarily the case—as the composition of distributions and their conditioning can be performed analytically, the overhead for their computation can be minimal.

This note presents the background theory for gprob, a probabilistic programming package that specializes to Gaussian random variables.

1 Distributions defined via linear maps.

Any Gaussian probabilistic program [1] can be expressed as a linear map of a number of independent identically distributed (iid) normal variables $e_{i}$ with zero mean and unit variance,

e_{i}\sim\mathcal{N}(0,1),\qquad i=1,\ldots,n.

(1)

In gprob, all random arrays are represented in this way — as affine maps of latent variables. E.g., a random array named $x$ indexed by a multi-dimensional index $ijk\ldots$ is represented as

x_{ijk\ldots}=\sum_{\lambda}e_{\lambda}a_{\lambda,ijk\ldots}+b_{ijk\ldots},

(2)

where $\{e_{\lambda}\sim\mathcal{N}(0,1)\}$ , $\lambda\in\mathbb{N}$ is a set of latent univariate normal variables, and $a$ and $b$ are numerical arrays with real or complex coefficients. The mean of $x_{ijk\ldots}$ equals $b_{ijk\ldots}$ , and the covariance is

C_{ijk\ldots,rst\ldots}=\sum_{\lambda}a_{\lambda,ijk\ldots}a^{*}_{\lambda,rst% \ldots},

(3)

where $rst\ldots$ run over the same set of indices as $ijk\ldots$ and ^∗ denotes complex conjugation.

New latent variables are allocated during calls to normal(). For example, normal(0, 1) produces one latent variable, and normal(size=sz), where sz is integer, produces sz new latent variables. A global inventory of all latent variables is maintained, but their joint distributions are only materialized within individual arrays. This has an effect on the memory footprint. Roughly speaking, if one random variable takes $m$ bytes, then two arrays of $n$ variables each will take $m\times(2\times n^{2})$ bytes, and not $m\times(2\times n)^{2}$ bytes.

Linear operations in gprob, such as addition or concatenation, produce new random arrays. The set of the latent variables of the result is a union of the sets of the latent variables of the operands. The $b$ of the operation result is simply the operation applied to the $b$ s of the operands. The $a$ of the operation result is found by first extending the $a$ s of the operands along their first dimension by zero entries, such that they all map the same union vector of latent variables, and then performing the operation iteratively over the first dimension of the extended $a$ s. (In practice, no iteration takes place, but the operations on $a$ s are vectorized.) To give a concrete example, the sum of two one-dimensional random variables $\mathbf{x}$ and $\mathbf{y}$ defined by the maps written in the matrix form as

\mathbf{x}=\begin{pmatrix}1&2\\ 0&3\\ \end{pmatrix}\begin{pmatrix}e_{1}\\ e_{2}\\ \end{pmatrix}+\begin{pmatrix}4\\ 5\\ \end{pmatrix},\qquad\mathbf{y}=\begin{pmatrix}1&2\\ 3&0\\ \end{pmatrix}\begin{pmatrix}e_{2}\\ e_{3}\\ \end{pmatrix}+\begin{pmatrix}-7\\ 1\\ \end{pmatrix},

(4)

is another variable, $\mathbf{z}=\mathbf{x}+\mathbf{y}$ , whose map is

\mathbf{z}=\begin{pmatrix}1&3&2\\ 0&6&0\\ \end{pmatrix}\begin{pmatrix}e_{1}\\ e_{2}\\ e_{3}\\ \end{pmatrix}+\begin{pmatrix}-3\\ 6\\ \end{pmatrix}.

(5)

2 Conditioning.

In a probabilistic language that maintains a joint distribution over all latent variables, such as the one presented in Ref. [1], conditioning is a statement that updates the latent distribution. One can think of the conditioning in this case as a post-selection to the realizations of the latent variables that ensures that all the conditions are fulfilled. In contrast, in gprob, conditioning of $x$ on $y=0$ is an expression that produces a new random variable, whose latent space is the union of the latent spaces of $x$ and $y$ . In this case, one can think of the conditioning as creating a version $x$ that is feed-forward corrected based on the measurement of $y$ .

To show how the conditioning is implemented, we consider the one-dimensional real case. Any random variable can be transformed to this representation by flattening and, if the variable is complex, by separating its real and imaginary parts. Let the conditioned variable be $\mathbf{x}$ , the condition be $\mathbf{y}=\mathbf{0}$ , and the result of conditioning be $\mathbf{z}\equiv\mathbf{x}|(\mathbf{y}=\mathbf{0})$ . All more general cases can be reduced to this: if a non-zero value of $\mathbf{y}$ was observed, this value can be subtracted from $\mathbf{y}$ , and if there is more than one observation, they can be stacked together in one $\mathbf{y}$ vector.

The variables $\mathbf{x}$ and $\mathbf{y}$ take values in $\mathbb{R}^{d_{x}}$ and $\mathbb{R}^{d_{y}}$ , and are represented by the maps of the same vector of latent variables $\mathbf{e}=(e_{1},\ldots,e_{n})$ ,

\mathbf{x}=\mathbf{A}_{x}^{T}\mathbf{e}+\mathbf{b}_{x},\qquad\mathbf{y}=% \mathbf{A}_{y}^{T}\mathbf{e}+\mathbf{b}_{y},

(6)

where $\mathbf{A}_{x}$ and $\mathbf{A}_{y}$ are, respectively, $n\times d_{x}$ and $n\times d_{y}$ matrices, and $\mathbf{b}_{x}$ and $\mathbf{b}_{y}$ are $d_{x}$ and $d_{y}$ -size mean vectors. The rank of the condition matrix, $\mathbf{A}_{y}$ , is assumed to be full and equal to $d_{y}\leq n$ , which means that the conditions are neither incompatible nor tautological. If the rank of $\mathbf{A}_{y}$ is less than $d_{y}$ , yet the conditions are compatible, some of them can be eliminated from $\mathbf{A}_{y}$ .

Conditioning splits the latent space into two orthogonal subspaces—one in which the allowed values of the latent variables are fully determined to yield the observed value of $\mathbf{y}$ , and one in which the values of the latent variables are unconstrained. The conditional mean is found by evaluating $\mathbf{x}$ at the value of the latent vector $\tilde{\mathbf{e}}$ from the first subspace, which satisfies

\mathbf{A}^{T}_{y}\tilde{\mathbf{e}}+\mathbf{b}_{y}=0.

(7)

This value can be formally found as

\tilde{\mathbf{e}}=-\left(\mathbf{A}^{+}_{y}\right)^{T}\mathbf{b}_{y},

(8)

where $\mathbf{A}_{y}^{+}$ is the Moore–Penrose pseudoinverse of $\mathbf{A}_{y}$ . (As $\mathbf{A}_{y}$ is normally not a square matrix, it does not have an inverse). To obtain $\tilde{\mathbf{e}}$ in practice, it is more convenient to directly solve the system of equations in Eq. (7) in the least square sense. The mean of the conditional variable $\mathbf{z}$ , $\mathbf{b}_{z}$ , is then given by

\mathbf{b}_{z}=\mathbf{b}_{x}-\mathbf{A}_{x}^{T}\left(\mathbf{A}^{+}_{y}\right% )^{T}\mathbf{b}_{y}.

(9)

The conditional map, $\mathbf{A}_{z}$ , is found by left-multiplying $\mathbf{A}_{x}$ by the projector on the latent subspace unaffected by the conditioning, i.e. the subspace orthogonal to the columns of the constraint matrix $\mathbf{A}_{y}$ . The operator that performs the desired projection is $\mathbf{I}-\mathbf{A}_{y}\mathbf{A}_{y}^{+}$ , where $\mathbf{I}$ is the identity matrix, and the conditional map is

\mathbf{A}_{z}=\mathbf{A}_{x}-\mathbf{A}_{y}\mathbf{A}^{+}_{y}\mathbf{A}_{x}.

(10)

One can verify that the conditional mean and covariance that follow from Eqs. (9-10) agree with those obtained based on the momentum representation of $\mathbf{x}$ and $\mathbf{y}$ (given in Appendix A). To make this comparison, we can use the facts that the cross-covariance matrix between $\mathbf{x}$ and $\mathbf{y}$ is

\mathbf{C}_{xy}=\mathbf{A}_{x}^{T}\mathbf{A}_{y},

(11)

that $\mathbf{P}=\mathbf{A}_{y}\mathbf{A}^{+}_{y}$ is an orthogonal projector satisfying $\mathbf{P}^{2}=\mathbf{P}$ and $\mathbf{P}^{T}=\mathbf{P}$ , and also that for a full-rank matrix $\mathbf{A}_{y}$ the block $\mathbf{C}_{yy}=\mathbf{A}^{T}_{y}\mathbf{A}_{y}$ is invertible, with its inverse being

\mathbf{C}_{yy}^{-1}=\mathbf{A}^{+}_{y}\left(\mathbf{A}^{+}_{y}\right)^{T}.

(12)

3 Causal conditioning of time series. Wiener filtering.

Sec. 2 described the conditioning of all elements of one random vector $\mathbf{x}$ on all elements of another vector $\mathbf{y}$ . There is also another, more specialized, problem that arises sometimes—the conditioning of individual elements $x_{i}$ , for $i=1,..,d_{x}$ , on subsets $\{y_{j}:\,j\leq s_{i}\}$ for some non-decreasing set of integer indices $s_{i}$ . This problem appears, in particular, in the causal analysis of time series via Kalman and Wiener filtering, where the elements of $\mathbf{x}$ and $\mathbf{y}$ correspond to a range of sequential moments in time. (For an introduction to filtering see, for example, Ref. [2]). The time stamps on $\mathbf{x}$ and $\mathbf{y}$ are non-decreasing, but not necessarily equally separated, and can also repeat. The task can be for each $x_{i}$ to find its distribution conditioned only on those observations $\{y_{j}\}$ that came before $x_{i}$ . We, therefore, call the corresponding problem “causal conditioning”.

Causal conditioning can be performed trivially using the result of Sec. 2 by iterating over the elements of $\mathbf{x}$ and each time selecting an appropriate range from $\mathbf{y}$ . This would be, roughly speaking, $d_{x}$ time more computationally expensive than the simple conditioning of all $x_{i}$ on all $y_{j}$ . Fortunately, there is an algorithm that allows to causally condition all elements of $\mathbf{x}$ in one go.

To introduce this algorithm, we first note that in the case of regular conditioning both the conditional mean and map (given by Eq. (9) and Eq. (10), respectively), could be expressed using a $d_{y}\times d_{x}$ matrix $\mathbf{B}$ ,

\mathbf{B}=\mathbf{A}^{+}_{y}\mathbf{A}_{x}

(13)

\mathbf{b}_{z}^{T}=\mathbf{b}_{x}^{T}-\mathbf{b}_{y}^{T}\mathbf{B},\qquad% \mathbf{A}_{z}=\mathbf{A}_{x}-\mathbf{A}_{y}\mathbf{B}.

(14)

The $i$ -th column of $\mathbf{B}$ consists of the coefficients that give the closest in Euclidean distance approximation of the $i$ -th column of $\mathbf{A}_{x}$ via a linear combination of the columns of $\mathbf{A}_{y}$ . From this point of view, it should be clear that for vectors whose components are indexed left to right, $\mathbf{x}=(x_{1},\ldots,x_{d_{x}})$ and $\mathbf{y}=(y_{1},\ldots,y_{d_{y}})$ , a causal structure can be enforced in conditioning by constraining the closest-approximation matrix $\mathbf{B}$ to be upper-triangular,

\mathbf{B}=\begin{pmatrix}\beta_{11}&\beta_{12}&\cdots&\beta_{1d_{x}}\\ \vdots&\vdots&\ddots&\vdots\\ \beta_{s_{1}1}&\beta_{s_{1}2}&\cdots&\beta_{s_{1}d_{x}}\\ 0&\beta_{(s_{1}+1)2}&\cdots&\beta_{(s_{1}+1)d_{x}}\\ \vdots&\vdots&\ddots&\vdots\\ 0&\beta_{s_{2}2}&\cdots&\beta_{s_{2}d_{x}}\\ 0&0&\cdots&\beta_{(s_{2}+1)d_{x}}\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\cdots&\beta_{d_{y}d_{x}}\end{pmatrix}.

(15)

Here, the list of indices $s_{i}$ is a non-decreasing sequence of integers from the range $[1,d_{y}]$ that is tailored to the causal structure of the problem. For example, in causal Wiener filtering of a scalar random process $x(t)$ with a scalar noisy measurement $y(t)$ , discretized as $x_{i}$ and $y_{i}$ , we could put $s_{i}=i-1$ to use all observations before the time $t_{i}$ for the prediction of $x(t_{i})$ .

What remains is to find a way of obtaining causally-constrained matrices $\mathbf{B}$ . Towards this goal, we introduce the QR decomposition of $\mathbf{A}_{y}$ ,

\mathbf{A}_{y}=\mathbf{Q}\mathbf{R},

(16)

where $\mathbf{Q}$ is a $n\times d_{y}$ matrix whose columns is an orthonormal set of vectors, and $\mathbf{R}$ is a $d_{y}\times d_{y}$ non-degenerate upper triangular matrix. The left-multiplication of $\mathbf{B}$ by $\mathbf{R}$ preserves the triangular structure (meaning that the elements of $\mathbf{B}$ that are zero in Eq. (15) are also zero in the multiplication product), and, therefore, the least-square solution of the system of equations

\mathbf{A}_{y}\mathbf{B}=\mathbf{A}_{x},

(17)

with the desired causal constraint on $\mathbf{B}$ can be expressed as

\mathbf{B}=\mathbf{R}^{-1}\mathcal{M}\left[\mathbf{Q}^{T}\mathbf{A}_{x}\right],

(18)

where $\mathcal{M}[\ldots]$ a masking operator that zeros for its argument $\ldots$ the matrix elements that are zero in Eq. (15). Finally, the causally conditioned mean and map are given by

\displaystyle\mathbf{b}_{z}^{T}=\mathbf{b}_{x}^{T}-\mathbf{b}_{y}^{T}\mathbf{R% }^{-1}\mathbf{Q}\,\mathcal{M}\left[\mathbf{Q}^{\dagger}\mathbf{A}_{x}\right],

\displaystyle\mathbf{A}_{z}=\mathbf{A}_{x}-\mathbf{Q}\,\mathcal{M}\left[% \mathbf{Q}^{\dagger}\mathbf{A}_{x}\right].

(19)

A similar analysis can be performed when $\mathbf{B}$ is generalized lower-triangular, which in the Wiener filtering theory corresponds to anti-causal filtering. The main difference in this case is that $\mathbf{A}_{y}$ needs to be decomposed as $\mathbf{Q}^{\prime}\mathbf{L}$ , where $\mathbf{L}$ is lower-triangular.

Appendix A: Gaussian distributions defined via moments.

The probability density function of a $d$ -dimensional Gaussian random variable with the mean vector $\mathbf{m}=(m_{1},\ldots,m_{d})$ and covariance matrix $\mathbf{C}$ to take the value $\mathbf{x}\in\mathbb{R}^{d}$ is given by

p(\mathbf{x})=(2\pi)^{-d/2}\left(\mathrm{det}(\mathbf{C})\right)^{-1/2}\mathrm% {exp}\left(-\frac{1}{2}(\mathbf{x}-\mathbf{m})^{T}\mathbf{C}^{-1}(\mathbf{x}-% \mathbf{m})\right),

(A1)

where $\mathbf{C}^{-1}$ is the inverse of the covariance. If $\mathbf{C}$ is degenerate and thus non-invertible, the possible realizations of the random variable do not cover the whole space $\mathbb{R}^{d}$ . Our convention when defining the probability density function for such variables is to assign $p(\mathbf{x})=0$ to impossible samples, and, when dealing with possible samples, to use the Moore–Penrose pseudoinverse $\mathbf{C}^{+}$ instead of $\mathbf{C}^{-1}$ and the determinant of the non-degenerate projection of $\mathbf{C}$ instead of $\mathrm{det}(\mathbf{C})$ in Eq. (A1). This is consistent with the way degenerate distributions are handled in scipy.stats [3].

If the random variable consists of two stacked vectors, $\mathbf{x}$ with the dimension $d_{x}$ , and $\mathbf{y}$ with the dimension $d_{y}$ , and $\mathbf{C}$ is their joint covariance matrix expressed in the block form as,

\mathbf{C}=\begin{pmatrix}\mathbf{C}_{xx}&\mathbf{C}_{xy}\\ \mathbf{C}_{yy}&\mathbf{C}_{yy}\end{pmatrix},

(A2)

the conditional probability density of $\mathbf{x}$ given $\mathbf{y}=\tilde{\mathbf{y}}$ is Gaussian with the mean and covariance given by

	$\displaystyle\mathbf{m}_{\mathrm{c}}=\mathbf{m}_{x}+\mathbf{C}_{xy}\mathbf{C}^% {-1}_{yy}(\tilde{\mathbf{y}}-\mathbf{m}_{y}),$		(A3)
	$\displaystyle\mathbf{C}_{\mathrm{c}}=\mathbf{C}_{xx}-\mathbf{C}_{xy}\mathbf{C}% ^{-1}_{yy}\mathbf{C}_{yx}.$		(A4)

Appendix B: Parametric derivatives and Fisher information.

A number of information-theoretic properties of distributions whose mean and covariance depend on some parameters $\mathbf{\theta}$ are expressed via the logarithm of the probability density and its derivatives. For non-degenerate Gaussian distributions, they are given, respectively, by

\mathrm{ln}\,p(\mathbf{x})=-\frac{1}{2}\Delta\mathbf{x}^{T}\mathbf{C}^{-1}% \Delta\mathbf{x}-\frac{1}{2}\mathrm{ln}\,\mathrm{det}(\mathbf{C}),

(B1)

and

\frac{\partial\mathrm{ln}\,p(\mathbf{x})}{\partial\theta_{i}}=\frac{\partial% \mathbf{m}^{T}}{\partial\theta_{i}}\mathbf{C}^{-1}\Delta\mathbf{x}+\frac{1}{2}% \Delta\mathbf{x}^{T}\mathbf{C}^{-1}\frac{\partial\mathbf{C}}{\partial\theta_{i% }}\mathbf{C}^{-1}\Delta\mathbf{x}-\frac{1}{2}\mathrm{Tr}\left[\mathbf{C}^{-1}% \frac{\partial\mathbf{C}}{\partial\theta_{i}}\right],

(B2)

where

\Delta\mathbf{x}=\mathbf{x}-\mathbf{m}.

(B3)

The expression for the derivatives was obtained using $\partial\mathbf{C}^{-1}/\partial\theta=-\mathbf{C}^{-1}(\partial\mathbf{C}/% \partial\theta)\mathbf{C}^{-1}$ , and the fact that the matrix determinant equals the product of the eigenvalues. The components of the Fisher information matrix $F_{ij}$ are given by (e.g. [4])

F_{ij}=\frac{\partial\mathbf{m}^{T}}{\partial\theta_{i}}\mathbf{C}^{-1}\frac{% \partial\mathbf{m}}{\partial\theta_{j}}+\frac{1}{2}\mathrm{Tr}\left[\mathbf{C}% ^{-1}\frac{\partial\mathbf{C}}{\partial\theta_{i}}\mathbf{C}^{-1}\frac{% \partial\mathbf{C}}{\partial\theta_{j}}\right],

(B4)

and related to the log probability density as

F_{ij}=\left\langle\frac{\partial\mathrm{ln}\,p(\mathbf{x})}{\partial\theta_{i% }}\frac{\partial\mathrm{ln}\,p(\mathbf{x})}{\partial\theta_{j}}\right\rangle=-% \left\langle\frac{\partial^{2}\mathrm{ln}\,p(\mathbf{x})}{\partial\theta_{i}% \partial\theta_{j}}\right\rangle.

(B5)

For completeness, here is also the expression for the second derivatives of the log probability density,

	$\displaystyle\frac{\partial^{2}\mathrm{ln}\,p(\mathbf{x})}{\partial\theta_{i}% \partial\theta_{j}}=$	$\displaystyle-F_{ij}+\frac{\partial^{2}\mathbf{m}^{T}}{\partial\theta_{i}% \partial\theta_{j}}\mathbf{C}^{-1}\Delta\mathbf{x}$
		$\displaystyle+\mathrm{Tr}\left[\left(\frac{1}{2}\mathbf{C}^{-1}\frac{\partial^% {2}\mathbf{C}}{\partial\theta_{i}\partial\theta_{j}}\mathbf{C}^{-1}+\mathbf{C}% ^{-1}\frac{\partial\mathbf{C}}{\partial\theta_{i}}\frac{\partial\mathbf{C}^{-1% }}{\partial\theta_{j}}\right)\left(\Delta\mathbf{x}\Delta\mathbf{x}^{T}-% \mathbf{C}\right)\right].$		(B6)

References

[1] D. Stein and S. Staton, “Compositional Semantics for Probabilistic Programs with Exact Conditioning,” in 2021 36th Annual ACM/IEEE Symposium on Logic in Computer Science (LICS), Jun. 2021, pp. 1–13. https://ieeexplore.ieee.org/document/9470552
[2] R. G. Brown, Introduction to Random Signal Analysis and Kalman Filtering. Wiley, 1983.
[3] Scipy reference scipy.stats.multivariate_normal.
[4] Y. Akimoto, Y. Nagata, I. Ono, and S. Kobayashi, “Theoretical foundation for CMA-ES from information geometric perspective,” Jun. 2012. http://arxiv.org/abs/1206.0730