“for foundational discoveries and inventions that enable machine learning with artificial neural networks” – Swedish Academy
The 2024 Nobel Prize in Physics recognizes two researchers for their impact on machine learning. As stated by the Swedish Academy, the award recognizes the contributions of John Hopfield and Geoffrey Hinton for application of “tools from physics to develop methods that are the foundation of today’s powerful machine learning”. This post aims to provide background around the award of the 2024 prize to Hopfield and Hinton for their pioneering work, and to draw the connection between the dynamical nature of neural networks and ensembles of energetic particles in statistical physics. There’s a body of work, going back to the 1980’s, around applying equilibrium statistical mechanics to modeling cognitive processes in the brain. This work from the 80’s is prefaced by some prior developments – Hodgkin and Huxley’s potentiation model of the squid giant axon, the McCulloch–Pitts Neuron, Rosenblatt’s Perceptron model, and the gradient descent based Hopfield network, a recurrent neural network for modeling associative or “content addressable” memory. An elaboration of these concepts from a thermodynamic viewpoint is the so called Boltzmann Machine, an augmented Hopfield model with Boltzmann sampling that exhibits dynamical behavior akin to Simulated Annealing.
Framework for Neural Attention: Energetic Interpretation of the Alignment Model (Bahdanau, Cho, Bengio)
Even as modern transformer architectures seem removed from the models described later and cited by the Academy in its press release, its pertinent to note the first descriptions of neural attention by Bahdanau, Cho and Bengio “Neural Machine Translation by Jointly Learning to Align and Translate” (BCB2015). Here, the mechanism is described in terms of “the probability or is associated energy
” which is derived from a sequence of hidden states. The authors proceed to interpret the energy as “reflecting the importance of annotation
with respect to the previous hidden state
in deciding the next state
… [which] intuitively implements a mechanism of attention in the decoder”. BCB2015 as a sequence transduction model uses a “biRNN” which acts on the input sequence in forward and reverse directions to generate the annotations. In introducing the attention mechanism, the energy or Hamiltonian is computed as an “alignment model” from hidden states of an RNN, whereas this RNN structure is dispensed with in Google’s celebrated paper “Attention is All you Need” which introduces stacked multi-head attention layers along with norm, residual projections along with full connected layers. BCB2015 determines the occupation probability of the network from classical Boltzmann statistics implying an exponential distribution scaled by the energy and temperature of the network (i.e. the Boltzmann function, in ML parlance, “softmax”).
Ising Model
As we consider the work mentioned in the committee’s award, we’ll start with the Ising model of ferromagnetism – The setting for this model is a large lattice where each site (indexed by
) on the lattice is associated with a random variable describing the spin intrinsic magnetic moment of a particle with spin
. In this setting, interpretation of spin is the traditional one, i.e. an internal degree of freedom obeying angular momentum commutation rules, though counter intuitively, integral spin in this case does not imply Bosons, we’re really just talking about spin one half particle states
such that their z-component spin projections are mapped to
(upto factors of
). In the Hopfield model, these discrete states will be mapped to “firing” or “quiescent” neuronal states in a lattice or “network” of connected neurons.
Returning to the Ising model, an electrostatic interaction or “coupling” term is associated with adjacent lattice sites
. In the presence of an external non uniform magnetic field
with local magnitude
, the Ising Hamiltonian
for an overall spin configuration
on the lattice is expressed as:
where is the magnetic moment.
Assuming particles are at thermal equilibrium with a heat bath (the lattice itself) at temperature , the probability for a configuration
at temperature
is the Boltzmann factor
divided by the partition function
where the sum is over all (lattice wide) spin configurations. Mean field approximations of the 2-D Ising model describe a partition function where the system spontaneously breaks symmetry along a temperature curve characterized by phase transition from high temperature “spin glass” states to (maximally probable) macroscopic states with high ferromagnetic ordering at low temperature (indicated by an “excessive” net magnetization
).
Hopfield Network (John Hopfield)
The Ising model associates a random variable at each lattice point with an intrinsic spin, that is, an angular momentum degree of freedom obeying the commutation relation
. In the Hopfield model, we leave the physical setting and lattice variables no longer signify physical observables – instead, we interpret spin up/down states as logical states of an idealized neuron (at each lattice point) corresponding to that neuron either firing an action potential (up) or remaining quiescent (down) within some prescribed time interval. The deterministic Hopfield network is a “zero temperature” limit of a more general class of network called the Boltzmann machine which incorporates stochastic update of neurons (also called “units”).
To elaborate further, we consider synchronous neurons where activity of the network is examined at discrete time intervals indexed by
. Examining activity at a given interval
, we expect units to be firing at random. A “cognitively meaningful” event such as associatively retrieving content based on a partial input pattern is said to occur when units in the network maintain their firing state over several update periods. This repeated firing state of the network (or “pattern”) is viewed as a binary representation of retrieved content. In a similar vein to the Ising system, we define a fictitious Hopfied Hamiltonian (or “Lyapunov Energy Function”) for the system:
where is a threshold value for a unit and
represents its state . In the zero temperature limit, the activation function is a step function with threshold
. This formalism ensures the network decreases its energy under random updates eventually converging to stable local minima in the energy. Binary representation of the global network state
at the extremum is taken as output.
Dynamic initialization of the network is achieved by setting input units to a desired start pattern – this may be some partial representation of content that the network is designed to reconstruct or “retrieve” as part of its energy minimizing update process. We consider the dynamical landscape of the network as an “energy” landscape, where Lyapunov Energy Function is plotted as a function of the global state, and retrieval states of the system correspond to minima located within basins of attraction in this landscape. Hopfield showed that attractors in the system are stable rather than choatic, so convergence is usually assured.
Boltzmann Machine (Geoffrey Hinton and others)
The Boltzmann machine can be described as a Hopfield network (with hidden layers) that incorporates stochastic update of units. Rather than the Hopfield network that learns the exact form of the learned output, the Boltzmann machine learns distributions over desired outputs. Another waying stating this it is that the network learns approximations of the output. These characteristics make the network generative in the sense that it’s able to create outputs that are unique, outputs that have never been seen or generated before. The energy for the network is as described by the Lyapunov function above, but the stochastic update process introduces the notion of an artificial non zero temperature for the system. This temperature in LLMs has the effect of scaling the softmax function. At each update interval, a unit
computes its total input
by adding a local bias
to a sum of weighted connections from all other active units
(i.e. ).
The probability that the unit is activated is given by the logistic function
. As units are updated, equilibrium occupation probability for the network will eventually reach a Boltzmann distribution with probability of a global network state
with energy
given by:
Coupling terms in the Lyapunov function are chosen so that energies of global network states (or “vectors”) represent the “cost” of these states – as such, stochastic search dynamics for the machine will evade inconsistent local extrema while searching for a low energy vector corresponding to the machine’s output. As alluded, this construction destabilizes poor local minima since the search is able to “jump” over energy barriers confining these minima.
The Boltzmann model introduces simulated annealing as a search optimization strategy. Here, we scale parameters by an artificial temperature T (multiplied by the Boltzmann constant ). Hopfield networks, discussed earlier, can be viewed as “zero temperature” variants of the Boltzmann machine.
In concluding, it’s interesting to note recent uptake of statistical mechanics and related tools within inter-disciplinary efforts to model brain function. This work, spanning fields such as applied physics, neuroscience and electrical engineering, grapples with a range of interesting problems.
Segue on the Onsager Solution to the Ising Porblem:
An exact solution of the 2-D Ising problem on square lattices was provided by Onsager, widely considered a landmark result in statistical mechanics since, among other things, it captures the ability to model phase transitions in the theory. The solution predicts logarithmic divergences in the specific heat of the system at the critical “Curie” temperature along with an expression for it’s long range order. The classic solution leverages a transfer matrix method for evaluating the 2-D partition function – details are beyond the scope of this article, the reader is referred to Bhattarcahrjee and Khare’s retrospective on the subject. Yet another class of solution to the 2-D Ising problem starts by considering a quantum partition function for the 1-D quantum Ising system as an “imaginary time” path integral, and then noting equivalence of the resulting expression to a partition function for the classical 2-D system evaluated as a transfer matrix. Leo Kadanoff, for example, motivates the blueprint for such a program by noting that every quantum mechanical trace [sum over histories] can be converted into a one-dimensional statistical mechanics problem and vice versa.