The Hessian

The Hessian#

In one variable, the second derivative of a function is a number that tells us about the curvature of the function. But in many variables, each partial derivative can change in many directions—so we need a matrix of second derivatives:

The Hessian matrix of a scalar-valued function \( f : \mathbb{R}^d \to \mathbb{R} \) is a square matrix of second-order partial derivatives:

\[\begin{split}\nabla^2 f(\mathbf{x}) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \dots & \frac{\partial^2 f}{\partial x_1 \partial x_d} \\ \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_d \partial x_1} & \dots & \frac{\partial^2 f}{\partial x_d^2} \end{bmatrix}, \quad\text{i.e.,}\quad [\nabla^2 f]_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j} \end{split}\]

Theorem (Clairaut Schwarz)

Let \(f: \mathbb{R}^d \to \mathbb{R}\) be a function such that both mixed partial derivatives \(\frac{\partial^2 f}{\partial x_i \partial x_j}\) and \(\frac{\partial^2 f}{\partial x_j \partial x_i}\) exist and are continuous on an open set containing a point \(\mathbf{x}_0\)

Then:

\[ \boxed{ \frac{\partial^2 f}{\partial x_i \partial x_j}(\mathbf{x}_0) = \frac{\partial^2 f}{\partial x_j \partial x_i}(\mathbf{x}_0) } \]

That is, the order of differentiation can be interchanged.

Clairut’s Theorem implies that the Hessian matrix is symmetric. We provide a proof sketch in the appendix.

Curvature in One Dimension#

Recall the second derivative in one dimension:

\(f(x) = x^2\): curve is “smiling” ⇒ second derivative is positive ⇒ function is curving upward.
\(f(x) = -x^2\): curve is “frowning” ⇒ second derivative is negative ⇒ function is curving downward.
Point: second derivative tells us how the function curves.

../_images/280ad0c54080d3566fe03cd951ef9e3b951eb1ef49d7ca01435ef7f39920d428.png

The Hessian generalizes this intuition to multiple Dimensions.

Curvature in Two Dimensions#

Now, let’s look at a simple 2D surface like:

\(f(x, y) = x^2 + y^2\): bowl shape
\(f(x, y) = x^2 - y^2\): saddle shape

../_images/e0cf843493f13d1a7132675e668b3016c8c3bf1951a4dc7cb594c2ac341c0e08.png

At each point, the function curves more or less in certain directions. The Hessian is a matrix that captures all this curvature information—it tells us how the slope (the gradient) changes in every direction.

A Simple Example#

\[ f(x, y) = 3x^2 + 2xy + y^2 \]

\(\frac{\partial f}{\partial x} = 6x + 2y\)
\(\frac{\partial f}{\partial y} = 2x + 2y\)
Hessian:

\[\begin{split} \nabla^2 f = \begin{bmatrix} 6 & 2 \\ 2 & 2 \end{bmatrix} \end{split}\]

Each entry corresponds to a second derivative—either in the x-direction, y-direction, or mixed for the off-diagonals.

Gradient Vector Fields#

The Hessian matrix describes how the gradient vector changes as you move through space. Let’s visualize this in a grid with arrows pointing in the direction of the gradient — i.e., where the function increases most steeply.

../_images/1b4b214a58ab48eab72c06aadc91143467dd8aba880c452e3cec108d21a5cb0b.png

The gradient vector field shows how gradients vary over space.
The Hessian is the rate of change of the gradient—it tells you how steep the slope is getting in every direction.
The direction and length of arrows = the gradient vector at each point.
The rate of change of those arrows = what the Hessian captures.

🔍 How This Works in the Two Examples#

🟢 Bowl: \(f(x, y) = x^2 + y^2\)#

Gradient: \(\nabla f(x, y) = [2x,\ 2y]\)
Hessian:

\[\begin{split} \nabla^2 f = \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix} \end{split}\]

This means:

In the x-direction, the gradient increases by 2 units per unit of x.
In the y-direction, the gradient increases by 2 units per unit of y.
The gradient field shows arrows pointing radially outward—getting longer linearly with distance from the origin.
This linear increase in slope is exactly what the constant entries (2) in the Hessian mean.

🔵 Saddle: \(f(x, y) = x^2 - y^2\)#

Gradient: \(\nabla f(x, y) = [2x,\ -2y]\)
Hessian:

\[\begin{split} \nabla^2 f = \begin{bmatrix} 2 & 0 \\ 0 & -2 \end{bmatrix} \end{split}\]

This means:

In the x-direction, the gradient increases at the same rate as before: 2 per unit of x.
In the y-direction, the gradient decreases (negative rate): -2 per unit of y.
The gradient field shows outward arrows in the x-direction, but inward arrows in the y-direction.
That flip in sign in the Hessian entry \(\partial^2 f/\partial y^2 = -2\) explains why the gradient pulls you toward the origin in y.

🧩 Optional Extension: The Hessian as Jacobian of the Gradient#

We can think of the Hessian as the Jacobian of the gradient — it’s the matrix of all partial derivatives of the components of the gradient vector field.

That is:

\[\begin{split} \nabla f(x, y) = \begin{bmatrix} \frac{\partial f}{\partial x} \\ \frac{\partial f}{\partial y} \end{bmatrix} \quad\Rightarrow\quad \nabla^2 f(x, y) = \text{Jacobian}\left( \nabla f(x, y) \right) \end{split}\]

Gradient Descent and the Hessian: Why Off-Diagonal Terms Matter#

🧠 Key Idea#

Gradient descent minimizes functions by moving in the direction opposite the gradient.

For quadratic functions:

\[ f(x) = \frac{1}{2} x^\top A x \quad \text{with gradient} \quad \nabla f(x) = A x \]

Here, \(A\) is the Hessian matrix, and it determines the shape of level sets and how gradient descent behaves.

If \(A\) is diagonal → level sets are axis-aligned ellipses (or circles).
If \(A\) has off-diagonal elements → ellipses are rotated, and gradient descent struggles (zig-zags).

Case 1: Spherical Hessian (Identity Matrix)#

A_sphere = np.array([[1, 0], [0, 1]])
plot_descent(A_sphere, "Spherical Hessian: $A = I$")

../_images/b81f84419d1bba1ea196cff3cfcc29d30f2be3c1b28048aecd7e6b5d4230fa8f.png

Level sets are circles.
Gradient descent takes straight, efficient steps toward the minimum.

Case 2: Anisotropic Hessian (Different Curvatures)#

../_images/aa9a4e27b95901b8c0c548fa66b5fc873f027c945b90b5e9fc2693bae141d497.png

Level sets are stretched ellipses.
Gradient descent zig-zags, especially in the steep direction.

Case 3: Skewed Hessian (Off-Diagonal Elements)#

../_images/0944ee110ce02ef6287964ccdf0578b1cddf124755fea114d7d6ebbd25b3b5e2.png

\(A = \begin{bmatrix} 10 & 6 \\ 6 & 8 \end{bmatrix}\)

Level sets are rotated ellipses.
Gradient descent strongly zig-zags and converges slowly.
The skew comes directly from the off-diagonal elements in the Hessian.

Off-diagonal terms in the Hessian rotate the level curves. Since gradient descent moves perpendicular to level curves, it zig-zags when these are skewed. This is one of the motivations for using second-order methods that take the Hessian into account.