Bayesian Inference for the Gaussian

Bayesian Inference for the Gaussian#

Let \(\mathbf{x} \in \mathbb{R}^d\) be drawn from a multivariate normal distribution with unknown mean \(\boldsymbol{\mu} \in \mathbb{R}^d\) and known covariance \(\boldsymbol{\Sigma}\):

\[ \mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}) \]

We can write this distribution in exponential family form:

\[ p(\mathbf{x} \mid \boldsymbol{\mu}) = \frac{1}{(2\pi)^{d/2} |\boldsymbol{\Sigma}|^{1/2}} \exp\left( - \frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) \right) \]

Rewriting this in exponential family form:

\[ p(\mathbf{x} \mid \boldsymbol{\mu}) = h(\mathbf{x}) \exp\left( \boldsymbol{\eta}^\top \mathbf{x} - A(\boldsymbol{\eta}) \right) \]

Where:

The natural parameter is: \(\boldsymbol{\eta} = \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}\)
The sufficient statistic is: \(T(\mathbf{x}) = \mathbf{x}\)
The log-partition function is: \(A(\boldsymbol{\eta}) = \frac{1}{2} \boldsymbol{\eta}^\top \boldsymbol{\Sigma} \boldsymbol{\eta}\)
The base measure is: \(h(\mathbf{x}) = \frac{1}{(2\pi)^{d/2} |\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2} \mathbf{x}^\top \boldsymbol{\Sigma}^{-1} \mathbf{x} \right)\)

Conjugate Prior for the Mean#

Given this exponential family form, the conjugate prior for the unknown mean \(\boldsymbol{\mu}\) (with fixed covariance \(\boldsymbol{\Sigma}\)) is a Gaussian prior:

\[ \boldsymbol{\mu} \sim \mathcal{N}(\boldsymbol{\mu}_0, \boldsymbol{\Lambda}_0) \]

This conjugate prior leads to a posterior distribution that is again Gaussian.

Derivation of the Posterior Distribution#

Given:

Observations: \(\mathbf{x}_1, \dots, \mathbf{x}_n \in \mathbb{R}^d\) i.i.d. from \(\mathbf{x}_i \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})\) with known covariance \(\boldsymbol{\Sigma}\) and unknown mean \(\boldsymbol{\mu}\)
Prior: \(\boldsymbol{\mu} \sim \mathcal{N}(\boldsymbol{\mu}_0, \boldsymbol{\Lambda}_0)\)

We want to compute the posterior distribution:

\[ p(\boldsymbol{\mu} \mid \mathbf{x}_{1:n}) \propto p(\boldsymbol{\mu}) \prod_{i=1}^n p(\mathbf{x}_i \mid \boldsymbol{\mu}) \]

Step 1: Likelihood and Prior (log form)#

Log likelihood:#

\[ \log p(\mathbf{x}_{1:n} \mid \boldsymbol{\mu}) = - \frac{n}{2} \log |2\pi \boldsymbol{\Sigma}| - \frac{1}{2} \sum_{i=1}^n (\mathbf{x}_i - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x}_i - \boldsymbol{\mu}) \]

Let \(\bar{\mathbf{x}} = \frac{1}{n} \sum_{i=1}^n \mathbf{x}_i\), then the sum of squared deviations becomes:

\[ \sum_{i=1}^n (\mathbf{x}_i - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x}_i - \boldsymbol{\mu}) = n (\boldsymbol{\mu} - \bar{\mathbf{x}})^\top \boldsymbol{\Sigma}^{-1} (\boldsymbol{\mu} - \bar{\mathbf{x}}) + C \]

So up to constants:

\[ \log p(\mathbf{x}_{1:n} \mid \boldsymbol{\mu}) = \text{const} - \frac{n}{2} (\boldsymbol{\mu} - \bar{\mathbf{x}})^\top \boldsymbol{\Sigma}^{-1} (\boldsymbol{\mu} - \bar{\mathbf{x}}) \]

Log prior:

\[ \log p(\boldsymbol{\mu}) = - \frac{1}{2} (\boldsymbol{\mu} - \boldsymbol{\mu}_0)^\top \boldsymbol{\Lambda}_0^{-1} (\boldsymbol{\mu} - \boldsymbol{\mu}_0) + \text{const} \]

Step 2: Posterior log density (unnormalized)#

Add log-prior and log-likelihood:

\[ \log p(\boldsymbol{\mu} \mid \mathbf{x}_{1:n}) = \text{const} - \frac{n}{2} (\boldsymbol{\mu} - \bar{\mathbf{x}})^\top \boldsymbol{\Sigma}^{-1} (\boldsymbol{\mu} - \bar{\mathbf{x}}) - \frac{1}{2} (\boldsymbol{\mu} - \boldsymbol{\mu}_0)^\top \boldsymbol{\Lambda}_0^{-1} (\boldsymbol{\mu} - \boldsymbol{\mu}_0) \]

Step 3: Complete the Square#

We want to write the log posterior in the form:

\[ - \frac{1}{2} (\boldsymbol{\mu} - \boldsymbol{\mu}_n)^\top \boldsymbol{\Lambda}_n^{-1} (\boldsymbol{\mu} - \boldsymbol{\mu}_n) + \text{const} \]

To do this, combine the two quadratic terms:

\[ \log p(\boldsymbol{\mu} \mid \mathbf{x}_{1:n}) = \text{const} - \frac{1}{2} \boldsymbol{\mu}^\top (n \boldsymbol{\Sigma}^{-1} + \boldsymbol{\Lambda}_0^{-1}) \boldsymbol{\mu} + \boldsymbol{\mu}^\top \left(n \boldsymbol{\Sigma}^{-1} \bar{\mathbf{x}} + \boldsymbol{\Lambda}_0^{-1} \boldsymbol{\mu}_0 \right) \]

Now we complete the square:

Let

Precision matrix:

\[ \boldsymbol{\Lambda}_n^{-1} = n \boldsymbol{\Sigma}^{-1} + \boldsymbol{\Lambda}_0^{-1} \]
Mean:

\[ \boldsymbol{\mu}_n = \boldsymbol{\Lambda}_n \left(n \boldsymbol{\Sigma}^{-1} \bar{\mathbf{x}} + \boldsymbol{\Lambda}_0^{-1} \boldsymbol{\mu}_0 \right) \]

Then:

\[ \log p(\boldsymbol{\mu} \mid \mathbf{x}_{1:n}) = \text{const} - \frac{1}{2} (\boldsymbol{\mu} - \boldsymbol{\mu}_n)^\top \boldsymbol{\Lambda}_n^{-1} (\boldsymbol{\mu} - \boldsymbol{\mu}_n) + \text{const} \]

✅ Final Posterior#

The posterior is a Gaussian distribution:

\[ \boxed{ p(\boldsymbol{\mu} \mid \mathbf{x}_{1:n}) = \mathcal{N} \left( \boldsymbol{\mu}_n,\; \boldsymbol{\Lambda}_n \right) } \]

With:

Posterior mean:

\[ \boldsymbol{\mu}_n = \left(n \boldsymbol{\Sigma}^{-1} + \boldsymbol{\Lambda}_0^{-1} \right)^{-1} \left(n \boldsymbol{\Sigma}^{-1} \bar{\mathbf{x}} + \boldsymbol{\Lambda}_0^{-1} \boldsymbol{\mu}_0 \right) \]
Posterior covariance:

\[ \boldsymbol{\Lambda}_n = \left(n \boldsymbol{\Sigma}^{-1} + \boldsymbol{\Lambda}_0^{-1} \right)^{-1} \]

Interpretation#

The posterior is a weighted average of the prior mean \(\boldsymbol{\mu}_0\) and the sample mean \(\bar{\mathbf{x}}\).
The prior can be interpreted as encoding \(n_0\) pseudo-observations, where \(n_0 = \text{tr}(\boldsymbol{\Sigma} \boldsymbol{\Lambda}_0^{-1})\) heuristically reflects its strength.
When \(n \to \infty\), the posterior converges to the MLE.
When \(n = 0\), the posterior is the prior.

Visualizing Posterior Updates for \(\boldsymbol{\mu}\)#

Let’s visualize how the posterior distribution over the mean \(\boldsymbol{\mu}\) updates as we observe more data points from a 2D Gaussian with known covariance.

We will use a Gaussian prior with mean \(\boldsymbol{\mu}_0 = (0, 0)\) and covariance \(\boldsymbol{\Lambda}_0 = \mathbf{I}\).

We will generate 10 data points from a 2D Gaussian with known covariance \(\boldsymbol{\Sigma} = \begin{pmatrix} 0.5 & 0.2 \\ 0.2 & 1.0 \end{pmatrix}\).

Show code cell source Hide code cell source

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse

def draw_cov_ellipse(mean, cov, ax, n_std=2, **kwargs):
    """Draw an ellipse representing the covariance matrix."""
    eigvals, eigvecs = np.linalg.eigh(cov)
    order = eigvals.argsort()[::-1]
    eigvals, eigvecs = eigvals[order], eigvecs[:, order]

    angle = np.degrees(np.arctan2(*eigvecs[:, 0][::-1]))
    width, height = 2 * n_std * np.sqrt(eigvals)
    ellipse = Ellipse(xy=mean, width=width, height=height, angle=angle, **kwargs)
    ax.add_patch(ellipse)

def posterior_updates_viz():
    # Ground truth
    mu_true = np.array([2.0, -1.0])
    Sigma = np.array([[0.5, 0.2], [0.2, 1.0]])  # known covariance

    # Prior
    mu0 = np.array([0.0, 0.0])
    Lambda0 = np.eye(2)

    # Generate data
    n_points = 10
    X = np.random.multivariate_normal(mu_true, Sigma, size=n_points)

    fig, axes = plt.subplots(2, 5, figsize=(16, 6))
    axes = axes.flatten()

    for i in range(n_points):
        ax = axes[i]

        if i == 0:
            mu_n = mu0
            Lambda_n_inv = Lambda0
            Lambda_n = np.linalg.inv(Lambda_n_inv)
            title = "n = 0 (Prior)"
        else:
            X_i = X[:i]
            x_bar = np.mean(X_i, axis=0)
            Lambda_n_inv = np.linalg.inv(Lambda0) + i * np.linalg.inv(Sigma)
            Lambda_n = np.linalg.inv(Lambda_n_inv)
            mu_n = Lambda_n @ (np.linalg.inv(Lambda0) @ mu0 + i * np.linalg.inv(Sigma) @ x_bar)
            title = f"n = {i}"

        # Plot
        ax.scatter(X[:i, 0], X[:i, 1], c='black', s=20, label='Data')
        draw_cov_ellipse(mu_n, Lambda_n, ax, edgecolor='blue', lw=2, facecolor='none', label='Posterior')
        ax.scatter(*mu_true, color='green', label='True Mean', marker='x', s=100)
        ax.set_xlim(-2, 4)
        ax.set_ylim(-4, 2)
        ax.set_aspect('equal')
        ax.set_title(title)
        ax.grid(True)

        if i == 0:
            draw_cov_ellipse(mu0, Lambda0, ax, edgecolor='gray', lw=2, facecolor='none', label='Prior')

        if i == 5:
            ax.legend()

    plt.suptitle("Posterior Updates for Mean of 2D Gaussian", fontsize=16)
    plt.tight_layout()
    plt.show()

posterior_updates_viz()

../_images/f8d0f3a22e723a2b52acd83ec60a2d103eaec913e0d24417f753422ca50e69a7.png

🧠 What You’ll See#

Prior ellipse centered at mu0 (0, 0)
As each new data point is observed:
- The posterior mean moves toward the true mean
- The posterior uncertainty (ellipse size) shrinks
By \(n=10\), the posterior is tightly centered around the true mean

Bayesian Linear Regression#

Let’s derive Bayesian Linear Regression using:

A Gaussian prior on the weights: \(\mathbf{w} \sim \mathcal{N}(\mathbf{w}_0, \boldsymbol{\Lambda}_0)\)
A Gaussian likelihood for outputs: \(y_i \mid \mathbf{x}_i, \mathbf{w} \sim \mathcal{N}(\mathbf{x}_i^\top \mathbf{w}, \sigma^2)\)
The method of completing the square to compute the posterior over weights.

Setup: Likelihood and Prior#

Let:

\(\mathbf{X} \in \mathbb{R}^{n \times d}\): design matrix (rows are \(\mathbf{x}_i^\top\))
\(\mathbf{y} \in \mathbb{R}^n\): target vector

Likelihood#

\[ p(\mathbf{y} \mid \mathbf{w}, \sigma^2) = \mathcal{N}(\mathbf{y} \mid \mathbf{Xw}, \sigma^2 \mathbf{I}) \]

The log-likelihood is (ignoring constants):

\[ \log p(\mathbf{y} \mid \mathbf{w}) = -\frac{1}{2\sigma^2} \| \mathbf{y} - \mathbf{Xw} \|^2 \]

Prior#

\[ p(\mathbf{w}) = \mathcal{N}(\mathbf{w} \mid \mathbf{w}_0, \boldsymbol{\Lambda}_0) \]

Log prior:

\[ \log p(\mathbf{w}) = -\frac{1}{2} (\mathbf{w} - \mathbf{w}_0)^\top \boldsymbol{\Lambda}_0^{-1} (\mathbf{w} - \mathbf{w}_0) \]

🧠 Posterior (up to normalization)#

We combine the log prior and log likelihood:

\[ \log p(\mathbf{w} \mid \mathbf{y}) = \text{const} - \frac{1}{2\sigma^2} (\mathbf{y} - \mathbf{Xw})^\top (\mathbf{y} - \mathbf{Xw}) - \frac{1}{2} (\mathbf{w} - \mathbf{w}_0)^\top \boldsymbol{\Lambda}_0^{-1} (\mathbf{w} - \mathbf{w}_0) \]

We now complete the square to identify the posterior as a multivariate normal in \(\mathbf{w}\).

🧩 Completing the Square#

Expand each term:#

Likelihood term:

\[ \| \mathbf{y} - \mathbf{Xw} \|^2 = \mathbf{y}^\top \mathbf{y} - 2 \mathbf{y}^\top \mathbf{Xw} + \mathbf{w}^\top \mathbf{X}^\top \mathbf{X} \mathbf{w} \]

Prior term:

\[ (\mathbf{w} - \mathbf{w}_0)^\top \boldsymbol{\Lambda}_0^{-1} (\mathbf{w} - \mathbf{w}_0) = \mathbf{w}^\top \boldsymbol{\Lambda}_0^{-1} \mathbf{w} - 2 \mathbf{w}_0^\top \boldsymbol{\Lambda}_0^{-1} \mathbf{w} + \mathbf{w}_0^\top \boldsymbol{\Lambda}_0^{-1} \mathbf{w}_0 \]

Combine terms:#

\[ \log p(\mathbf{w} \mid \mathbf{y}) = \text{const} - \frac{1}{2} \mathbf{w}^\top \left( \frac{1}{\sigma^2} \mathbf{X}^\top \mathbf{X} + \boldsymbol{\Lambda}_0^{-1} \right) \mathbf{w} + \mathbf{w}^\top \left( \frac{1}{\sigma^2} \mathbf{X}^\top \mathbf{y} + \boldsymbol{\Lambda}_0^{-1} \mathbf{w}_0 \right) + \text{const} \]

✅ Posterior Distribution#

This is the canonical form of a log-density of a Gaussian:

\[ \log p(\mathbf{w} \mid \mathbf{y}) = \text{const} - \frac{1}{2} (\mathbf{w} - \mathbf{w}_n)^\top \boldsymbol{\Lambda}_n^{-1} (\mathbf{w} - \mathbf{w}_n) \]

Hence the posterior is Gaussian:

\[ \boxed{ p(\mathbf{w} \mid \mathbf{y}) = \mathcal{N}(\mathbf{w}_n, \boldsymbol{\Lambda}_n) } \]