Given the explosion of interest in deep learning, physics refugees will ponder the extent to which results from that field can be applied. Enigmatically, Christopher Olah prefaces his excellent essay on low dimensional embeddings of MNIST images (here) by saying “At some fundamental level, no one understands machine learning”. From another standpoint, an elaboration of that message could be: “In some real sense, no one understands non-equilibrium thermodynamics”. Nevertheless, startlingly high accuracy rates for established benchmarks show the effectiveness of machine learning for a number of applications ranging from simple classification, to image recognition and language processing.

There is a body of work, going back to the 1980’s, around applying equilibrium statistical mechanics to modeling cognitive processes in the brain. This work from the 80’s is prefaced by some prior developments – Hodgkin and Huxley’s potentiation model of the squid giant axon, the McCulloch–Pitts Neuron, Rosenblatt’s Perceptron model, and the gradient descent based Hopfield network, a recurrent neural network for modeling associative or “content addressable” memory. An extension of these concepts from a thermodynamic viewpoint is the so called Boltzmann Machine, an augmented Hopfield model with Boltzmann/Gibbs sampling that exhibits dynamical behavior akin to Simulated Annealing.

Ising Model

We consider the Ising model of ferromagnetism as the starting point for an investigation on the stochastic dynamics of deep learning. The setting for this model is a  lattice $\Lambda$ where each point (indexed by $i$) on the lattice is associated with a random variable describing the intrinsic magnetic moment of a particle  with spin $\sigma_{i} \in \{+1,-1\}$. In this setting, interpretation of spin is the traditional one, i.e. an internal degree of freedom obeying angular momentum commutation rules, though counter intuitively, integral spin in this case does not imply Bosons, we’re really just talking about spin one half particle states $\lvert \uparrow \rangle$  $\lvert \downarrow \rangle$ such that their z-component spin projections are mapped to $\{+1,-1\}$ (upto factors of $\hbar$). In the Hopfield model, these discrete  states will be mapped to “firing” or “quiescent” neuronal states in a lattice or “network” of connected neurons.

Returning to the Ising model, an electrostatic interaction or “coupling” term $J_{i,j}$ is associated with adjacent lattice sites $(i,j)$. In the presence of an external non uniform magnetic field $\overset{\rightharpoonup} B$ with local magnitude $B_{i}$, the Ising Hamiltonian $H(\gamma)$ for an overall spin configuration $\gamma$ on the lattice is expressed  as:

$H(\gamma) = - \sum_{i,j} J_{i,j} \sigma_{i} \sigma_{j} - \mu \sum_{j} B_{j} \sigma_{j}$

where $\mu$ is the magnetic moment.

Assuming particles are at thermal equilibrium with a heat bath (the lattice itself) at temperature $T$, the probability for a configuration $\gamma$ at temperature $T$ is the Boltzmann factor $e^{-H(\gamma)/{k_{B}T}}$ divided by the partition function $Z_{T} = \sum_{\gamma^{\prime}} e^{-H(\gamma^{\prime})/k_{B}T}$ where the sum is over all (lattice wide) spin configurations (i.e., Boltzmann/Gibbs measure). Mean field approximations of the 2-D Ising model describe a partition function where the system spontaneously breaks symmetry along a temperature curve characterized by phase transition from high temperature “spin glass” states to (maximally probable) macroscopic states with high ferromagnetic ordering at low temperature (indicated by an “excessive” net magnetization $\mu \sum_{i} \sigma_{i}$).

Nostalgic But Mostly Irrelevant Segue on the Onsager Solution:

An exact solution of the 2-D Ising problem on square lattices was provided by Onsager, widely considered a landmark result in statistical mechanics since, among other things, it captures the ability to model phase transitions in the theory. The solution predicts logarithmic divergences in the specific heat of the system at the critical “Curie” temperature along with an expression for it’s long range order. The classic solution leverages a transfer matrix method for evaluating the 2-D partition function – details are beyond the scope of this article, the reader is referred to Bhattarcahrjee and Khare’s retrospective on the subject. Yet another class of solution to the 2-D Ising problem starts by considering a quantum partition function for the 1-D quantum Ising system as an “imaginary time” path integral, and then noting equivalence of the resulting expression to a partition function for the classical 2-D system evaluated as a transfer matrix. Kadanoff, for example, motivates the blueprint for such a program by noting that “every quantum mechanical trace [sum over histories] can be converted into a one-dimensional statistical mechanics problem and vice versa” (slide 6 of the lecture notes located here.)

Hopfield Network

Returning to the matter at hand, we saw that the Ising model associated a random variable at each lattice point with an intrinsic spin, that is, an angular momentum degree of freedom $\overset{\rightharpoonup} S$  obeying the commutation relation $[S_{i},S_{j}]= i \hbar \varepsilon_{ijk} S_{k}$ with z-axis component taking on values $\{+ \hbar/2, -\hbar/2\}$ corresponding to spin “up” and “down” states.  In the Hopfield model, we leave this physical setting and lattice variables no longer signify physical observables – instead, we  interpret spin up/down states as logical states of an idealized neuron (at each lattice point) corresponding to that neuron either firing an action potential (up) or remaining quiescent (down) within some prescribed time interval. The deterministic Hopfield network is a “zero temperature” limit of a more general class of network called the Boltzmann machine which incorporates stochastic update of neurons (also called “units”).

To elaborate further, we consider synchronous neurons where activity of the network is examined at discrete time intervals $t$ indexed by $i$. Examining activity at a given interval  $t_{i}$, we expect units to be firing at random. A “cognitively meaningful” event such as associatively retrieving content based on a partial input pattern is said to occur when units in the network maintain their firing state over several update periods. This repeated firing state of the network (or “pattern”) is viewed as a binary representation of retrieved content. In a similar vein to the Ising system, we define a fictitious Hopfied Hamiltonian (or “Lyapunov Energy”) for the system:

$H(\gamma) = - \frac{1}{2} \sum_{i,j} J_{i,j} \sigma_{i} \sigma_{j} - \sum_{j} \theta_{j} \sigma_{j}$

where $\theta_{j}$ is a threshold value for a unit and $\sigma_{j}$ represents its state . In the zero temperature limit, the activation function is a step function with threshold $\theta_{j}$. This formalism ensures the network decreases its energy  under random updates eventually converging to stable local minima in the energy. Binary representation of the global network state $\gamma^{\prime}$ at the extremum is taken as output.

Dynamic initialization of the network is achieved by setting input units to a desired start pattern – this may be some partial representation of content that the network is designed to reconstruct or “retrieve” as part of its energy minimizing update process. We consider the dynamical landscape of the network as an “energy” landscape, where Lyapunov Energy is plotted as a function of the global state, and retrieval states of the system correspond to minima located within basins of attraction in this landscape. Hopfield showed that attractors in the system are stable rather than choatic, so convergence is usually assured.

Boltzmann Machine

The Boltzmann machine can be described as a Hopfield network (with hidden layers) that incorporates stochastic update of units. The energy for the network is as described by the Lyapunov function above, but the stochastic update process introduces the notion of an artificial non zero temperature $T$ for the system. At each update interval, a unit $i$ computes its total input $\lambda_{i}$ but adding a local bias $b_{i}$ to a sum of weighted connections from all other active units

(i.e. $\lambda_{i} = b_{i} + \sum_{j} J_{ij} \sigma_{j}$).

The probability that the unit $i$ is activated is given by the logistic function $Pr(\sigma_{i} = 1) = \frac {1}{1 + e^{- \lambda_{i}}}$. As units are updated, equilibrium occupation probability for the network will eventually reach a Boltzmann distribution with probability of a global network state $\gamma^{\prime}$ with energy $H(\gamma^{\prime})$ given by:

$Pr( \gamma^{\prime} ) = \frac{ e^{ (-H(\gamma^{ \prime})/k_{b}T)}}{ \sum_{\gamma} e^{ (-H(\gamma)/k_{b}T)} }$

Coupling terms in the Lyapunov function are chosen so that energies of global network states (or “vectors”) represent the “cost” of these states – as such, stochastic search dynamics for the machine will evade inconsistent local extrema while searching for a low energy vector corresponding to the machine’s output. As alluded, this construction destabilizes poor local minima since the search is able to “jump” over energy barriers confining these minima.

The Boltzmann model introduces simulated annealing as a search optimization strategy. Here, we scale parameters by an artificial temperature T (multiplied by the Boltzmann constant $k_{b}$). Analogous to thermodynamic systems, the network reduces temperature  from a large initial value to an equilibrium distribution that makes low energy solutions highly probable. Hopfield networks, discussed earlier, can be viewed as “zero temperature” variants of the Boltzmann machine.

Conclusion

In concluding, recent years have witnessed greater uptake of statistical mechanics and related tools within inter-disciplinary efforts to model brain function. This work, spanning several fields such as applied physics, neuroscience and electrical engineering, grapples with a range of interesting problems from models of category learning in infants to machine learning applicability of implied entropy reversals over micro-timescales in non-equilibrium statistical mechanics. On the other hand, there is wide availability of programming libraries that abstract most of the underlying construction. This, along with the advent of hardware acceleration methods at relatively low cost, open the field to application developers seeking to apply these methods to a range of automation tasks.