Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond

Young Ben
3 min readFeb 9, 2023

--

Loss landscape and Hessian matrix

If you understand the loss landscape of a deep learning model, you can evaluate whether the algorithm is finding an optimal solution.

The shape of the loss landscape can be understood by examining the Hessian matrix. The Hessian matrix is a second-order derivative of the loss function with respect to the model parameters. The second derivative provides information about the curvature or concavity of the function. For example, a positive second derivative implies that the function is convex downward at a particular input (i.e., parameter), while a negative second derivative implies that the function is convex upward. And a larger derivative indicates a more convex shape.

The eigenvalues of the Hessian matrix provide further information about the curvature of the function. The eigenvalues determine the most significantly curved direction and degree, with all positive eigenvalues implying that the function is locally convex and all negative eigenvalues suggesting a concave function. If eigenvalues are a mix of positive and negative values, the function has a saddle shape. (*FYI, the Hessian matrix transforms the function in a way that makes it more convex or concave. The direction with the most significant curvature and the degree of this curvature are represented by the eigenvector and eigenvalue, respectively.)

In simpler terms, the eigenvalues of the Hessian matrix allow us to determine the shape of a function at a specific point and thereby gain insight into the shape of the loss landscape. For example, if the eigenvalues are close to zero, the loss landscape is more flat.

The eigenvalue of the loss function’s Hessian matrix indicates the potential for change in the loss in a specific direction. If the eigenvalue is large, it means that there is significant room for the loss to change in that direction, as the curvature is high. On the other hand, if the eigenvalue is close to zero, it means that there is limited potential for change in the loss, as the loss surface is flat.

Main findings

Therefore, in this paper, the eigenvalue distribution of the Hessian matrix is analyzed from two angles:

  1. The bulk, which refers to values that are concentrated near zero in the distribution,
  2. The edge, which refers to values that are scattered away from zero in the distribution.

The paper conducted experiments using the MNIST data and 2D Gaussian blobs data on an MLP model, and the main results were as follows:

  • The larger the model size, the stronger the singularity, meaning the eigenvalues distribution is more concentrated near zero
  • These results were consistent even when using different loss functions
  • The more the model was trained, with more epochs, the stronger the singularity became
  • As the complexity of the data increases (i.e., as variance increases), the values of the eigenvalues also increase.
  • The bulk of the eigenvalue distribution depends on the size of the model, while the edge of the distribution depends on the data.

Sagun, L., Bottou, L., & LeCun, Y. (2016). Eigenvalues of the hessian in deep learning: Singularity and beyond. arXiv preprint arXiv:1611.07476.

--

--

Young Ben

Deep Learning, Graph-based Recommender Systems, AI in Finance, Tabular data, Personal Financial Data