Archives for category: Uncategorized

Given the explosion of interest in deep learning, physics refugees will ponder the extent to which results from that field can be applied. Enigmatically, Christopher Olah prefaces his excellent essay on low dimensional embeddings of MNIST images (here) by saying “At some fundamental level, no one understands machine learning”. From another standpoint, an elaboration of that message could be: “In some real sense, no one understands non-equilibrium thermodynamics”. Nevertheless, startlingly high accuracy rates for established benchmarks show the effectiveness of machine learning for a number of applications ranging from simple classification, to image recognition and language processing.

There is a body of work, going back to the 1980’s, around applying equilibrium statistical mechanics to modeling cognitive processes in the brain. This work from the 80’s is prefaced by some prior developments – Hodgkin and Huxley’s potentiation model of the squid giant axon, the McCulloch–Pitts Neuron, Rosenblatt’s Perceptron model, and the gradient descent based Hopfield network, a recurrent neural network for modeling associative or “content addressable” memory. An extension of these concepts from a thermodynamic viewpoint is the so called Boltzmann Machine, an augmented Hopfield model with Boltzmann/Gibbs sampling that exhibits dynamical behavior akin to Simulated Annealing.

Ising Model

We consider the Ising model of ferromagnetism as the starting point for an investigation on the stochastic dynamics of deep learning. The setting for this model is a  lattice $\Lambda$ where each point (indexed by $i$) on the lattice is associated with a random variable describing the intrinsic magnetic moment of a particle  with spin $\sigma_{i} \in \{+1,-1\}$. In this setting, interpretation of spin is the traditional one, i.e. an internal degree of freedom obeying angular momentum commutation rules, though counter intuitively, integral spin in this case does not imply Bosons, we’re really just talking about spin one half particle states $\lvert \uparrow \rangle$  $\lvert \downarrow \rangle$ such that their z-component spin projections are mapped to $\{+1,-1\}$ (upto factors of $\hbar$). In the Hopfield model, these discrete  states will be mapped to “firing” or “quiescent” neuronal states in a lattice or “network” of connected neurons.

Returning to the Ising model, an electrostatic interaction or “coupling” term $J_{i,j}$ is associated with adjacent lattice sites $(i,j)$. In the presence of an external non uniform magnetic field $\overset{\rightharpoonup} B$ with local magnitude $B_{i}$, the Ising Hamiltonian $H(\gamma)$ for an overall spin configuration $\gamma$ on the lattice is expressed  as:

$H(\gamma) = - \sum_{i,j} J_{i,j} \sigma_{i} \sigma_{j} - \mu \sum_{j} B_{j} \sigma_{j}$

where $\mu$ is the magnetic moment.

Assuming particles are at thermal equilibrium with a heat bath (the lattice itself) at temperature $T$, the probability for a configuration $\gamma$ at temperature $T$ is the Boltzmann factor $e^{-H(\gamma)/{k_{B}T}}$ divided by the partition function $Z_{T} = \sum_{\gamma^{\prime}} e^{-H(\gamma^{\prime})/k_{B}T}$ where the sum is over all (lattice wide) spin configurations (i.e., Boltzmann/Gibbs measure). Mean field approximations of the 2-D Ising model describe a partition function where the system spontaneously breaks symmetry along a temperature curve characterized by phase transition from high temperature “spin glass” states to (maximally probable) macroscopic states with high ferromagnetic ordering at low temperature (indicated by an “excessive” net magnetization $\mu \sum_{i} \sigma_{i}$).

Nostalgic But Mostly Irrelevant Segue on the Onsager Solution:

An exact solution of the 2-D Ising problem on square lattices was provided by Onsager, widely considered a landmark result in statistical mechanics since, among other things, it captures the ability to model phase transitions in the theory. The solution predicts logarithmic divergences in the specific heat of the system at the critical “Curie” temperature along with an expression for it’s long range order. The classic solution leverages a transfer matrix method for evaluating the 2-D partition function – details are beyond the scope of this article, the reader is referred to Bhattarcahrjee and Khare’s retrospective on the subject. Yet another class of solution to the 2-D Ising problem starts by considering a quantum partition function for the 1-D quantum Ising system as an “imaginary time” path integral, and then noting equivalence of the resulting expression to a partition function for the classical 2-D system evaluated as a transfer matrix. Kadanoff, for example, motivates the blueprint for such a program by noting that “every quantum mechanical trace [sum over histories] can be converted into a one-dimensional statistical mechanics problem and vice versa” (slide 6 of the lecture notes located here.)

Hopfield Network

Returning to the matter at hand, we saw that the Ising model associated a random variable at each lattice point with an intrinsic spin, that is, an angular momentum degree of freedom $\overset{\rightharpoonup} S$  obeying the commutation relation $[S_{i},S_{j}]= i \hbar \varepsilon_{ijk} S_{k}$ with z-axis component taking on values $\{+ \hbar/2, -\hbar/2\}$ corresponding to spin “up” and “down” states.  In the Hopfield model, we leave this physical setting and lattice variables no longer signify physical observables – instead, we  interpret spin up/down states as logical states of an idealized neuron (at each lattice point) corresponding to that neuron either firing an action potential (up) or remaining quiescent (down) within some prescribed time interval. The deterministic Hopfield network is a “zero temperature” limit of a more general class of network called the Boltzmann machine which incorporates stochastic update of neurons (also called “units”).

To elaborate further, we consider synchronous neurons where activity of the network is examined at discrete time intervals $t$ indexed by $i$. Examining activity at a given interval  $t_{i}$, we expect units to be firing at random. A “cognitively meaningful” event such as associatively retrieving content based on a partial input pattern is said to occur when units in the network maintain their firing state over several update periods. This repeated firing state of the network (or “pattern”) is viewed as a binary representation of retrieved content. In a similar vein to the Ising system, we define a fictitious Hopfied Hamiltonian (or “Lyapunov Energy”) for the system:

$H(\gamma) = - \frac{1}{2} \sum_{i,j} J_{i,j} \sigma_{i} \sigma_{j} - \sum_{j} \theta_{j} \sigma_{j}$

where $\theta_{j}$ is a threshold value for a unit and $\sigma_{j}$ represents its state . In the zero temperature limit, the activation function is a step function with threshold $\theta_{j}$. This formalism ensures the network decreases its energy  under random updates eventually converging to stable local minima in the energy. Binary representation of the global network state $\gamma^{\prime}$ at the extremum is taken as output.

Dynamic initialization of the network is achieved by setting input units to a desired start pattern – this may be some partial representation of content that the network is designed to reconstruct or “retrieve” as part of its energy minimizing update process. We consider the dynamical landscape of the network as an “energy” landscape, where Lyapunov Energy is plotted as a function of the global state, and retrieval states of the system correspond to minima located within basins of attraction in this landscape. Hopfield showed that attractors in the system are stable rather than choatic, so convergence is usually assured.

Boltzmann Machine

The Boltzmann machine can be described as a Hopfield network (with hidden layers) that incorporates stochastic update of units. The energy for the network is as described by the Lyapunov function above, but the stochastic update process introduces the notion of an artificial non zero temperature $T$ for the system. At each update interval, a unit $i$ computes its total input $\lambda_{i}$ but adding a local bias $b_{i}$ to a sum of weighted connections from all other active units

(i.e. $\lambda_{i} = b_{i} + \sum_{j} J_{ij} \sigma_{j}$).

The probability that the unit $i$ is activated is given by the logistic function $Pr(\sigma_{i} = 1) = \frac {1}{1 + e^{- \lambda_{i}}}$. As units are updated, equilibrium occupation probability for the network will eventually reach a Boltzmann distribution with probability of a global network state $\gamma^{\prime}$ with energy $H(\gamma^{\prime})$ given by:

$Pr( \gamma^{\prime} ) = \frac{ e^{ (-H(\gamma^{ \prime})/k_{b}T)}}{ \sum_{\gamma} e^{ (-H(\gamma)/k_{b}T)} }$

Coupling terms in the Lyapunov function are chosen so that energies of global network states (or “vectors”) represent the “cost” of these states – as such, stochastic search dynamics for the machine will evade inconsistent local extrema while searching for a low energy vector corresponding to the machine’s output. As alluded, this construction destabilizes poor local minima since the search is able to “jump” over energy barriers confining these minima.

The Boltzmann model introduces simulated annealing as a search optimization strategy. Here, we scale parameters by an artificial temperature T (multiplied by the Boltzmann constant $k_{b}$). Analogous to thermodynamic systems, the network reduces temperature  from a large initial value to an equilibrium distribution that makes low energy solutions highly probable. Hopfield networks, discussed earlier, can be viewed as “zero temperature” variants of the Boltzmann machine.

Conclusion

In concluding, recent years have witnessed greater uptake of statistical mechanics and related tools within inter-disciplinary efforts to model brain function. This work, spanning several fields such as applied physics, neuroscience and electrical engineering, grapples with a range of interesting problems from models of category learning in infants to machine learning applicability of implied entropy reversals over micro-timescales in non-equilibrium statistical mechanics. On the other hand, there is wide availability of programming libraries that abstract most of the underlying construction. This, along with the advent of hardware acceleration methods at relatively low cost, open the field to application developers seeking to apply these methods to a range of automation tasks.

Slides from a recent presentation on homomorphic encryption schemes based on approximate GCD, learning with errors (LWE) and ring LWE assumptions (click on the link below):

# Homomorphic Encryption from the LWE Assumption

Slides from a recent talk on secure computation / zero knowledge proofs (click on the link below):

# Multi-Party Computation / Zero Knowledge Proofs / Oblivious Transfer

At some point, you may find yourself questioning the security of NIST specified prime curves. Such thoughts are not uncommon given the compromised status of Dual_EC_DRBG. More significantly, NSA has dropped NIST p-256 and current guidance is to use p-384:

https://www.nsa.gov/ia/programs/suiteb_cryptography/

Interestingly, the agency has also stated that “With respect to IAD customers using large, unclassified PKI systems, remaining at 112 bits of security (i.e. 2048-bit RSA) may be preferable (or sometimes necessary due to budget constraints) for the near-term in anticipation of deploying quantum resistant asymmetric algorithms upon their first availability”.

The intent of this post is to try and question some theories around deliberate specification of weak curves  by selection of “weak seed” inputs to the SHA-1 function as part of the ANSI / NIST curve generation process. The reasoning presented expands on implied requirements around the number of weak curves that must exist over 256 bit prime fields in order for such theories to hold.

Weak Curves:

It is widely known that certain classes of “weak curves” exist where the standard ECDLP problem is reduced to a simpler problem. For example, attacks on curves which leverage pairings are documented in the literature. Such attacks leverage bilinear maps $\langle \cdot,\cdot\rangle: \mathbb{G}_{1}\times \mathbb{G}_{2} \rightarrow \mathbb{J}$ from abelian groups $\mathbb{G}_{1}$ and $\mathbb{G}_{2}$ that evaluate to elements in some other abelian group $\mathbb{J}$. For the Weil and Tate pairings, we map from groups in $E(\mathbb{F}_{p^{k}})$ to a multiplicative group $\mathbb{F}^{*}_{p^{k}}$ where  $\mathbb{F}_{p^{k}}$ is some finite extension of $\mathbb{F}_{p}$ with $k > 1$. Pairings, when applied to so called supersingular curves with embedding degree less than 6, reduce standard ECDLP to the sub exponential DLP in multiplicative groups. Another case of weak curves are the so called prime field anomolous curves with $\sharp E(\mathbb{F}_{p}) = p$, though in general we don’t expect such an equality to hold (On the other hand, the so called Hasse bound tells us that the most that $\sharp E(\mathbb{F}_{p})$ can differ from $p + 1$ is by the trace of Frobenius $t$ where $\mid t \mid \le 2 \sqrt{p}$)

Koblitz and Menezes (KM2015):

For the rest of this post, we’ll dissect some parts of a post Snowden defense of ECC authored by Koblitz and Menezes, two mathematicians of high repute (Koblitz in particular is the co-inventor of elliptic curve cryptography, author of the landmark text “A Course in Number Theory and Cryptography”, and noted sceptic of reductionist methods / provable security) . The ECC defense is entitled “An Riddle Wrapped in an Enigma” published here –

https://eprint.iacr.org/2015/1018.pdf

The line of thought we’re concerned with traces lineage of the NIST prime curves to NSA  involvement in the ANSI X9F1 committee, the exact nature of NSA involvement is not well outlined in the paper, but the implication is that NSA was willing to generate curves on behalf of the ANSI effort, and at the time, had the technology to enumerate orders of the resulting groups.

In order to ensure that known weak curves (known to the NSA, that is) were not selected, it was decided that curve co-efficients be selected by passing a seed to the SHA-1 function. The security of this method would be assured given the fact that it’s hard to invert SHA-1, i.e., in the event that NSA was aware of certain weak curves, it would be hard to find the SHA-1 pre-image that would result in the required co-efficients for the weak curve.

So this brings us to one of the primary arguments against security of the prime curves, and it’s centered on the provenance / legitimacy of the SHA-1 seeds used for curve generation. In the vein of classic conspiracy theories, the argument goes that the NSA could have known about “many” weak curves, and repeatedly picked random seeds until one of the seeds resulted in a weak curve. Koblitz and Menezes in the referenced paper refute this line of thought by arguing that in order to find a seed for a weak prime curve in $2^{48}$ SHA-1 trials, there would have to be approximately $2^{209}$ weak curves over 256 bit prime fields. The argument is as follows:

For a field $\mathbb{F}_{p}$, the number the number of “isomorphism classes” of curves over the field is $2p$. Therefore, for a 256 bit prime there are about $2^{257}$ isomorphism classes of elliptic curves over $\mathbb{F}_{p}$.  Also, for a given curve, we would prefer $\sharp E(\mathbb{F}_{p})$ to be as close to a prime as possible since complexity of ECDLP is based on the largest prime subgroup of $E(\mathbb{F}_{p})$ – put another way, an attacker can use the group order to solve ECDLP in subgroups whose orders are co-prime and then map back to the full group through Chinese Remainder Theorem (CRT) – therefore, it’s important to know the group order, which circa 1997 was purportedly a computation only within reach of the NSA – Koblitz and Menezes (KM2015) imply this as one of the reasons behind involvement of NSA in the ANSI process.

With $E(\mathbb{F}_{p})$ we’re interested primarily in groups of prime order (given CRT derived weaknesses), so to pick up the analysis of KM2015 from where we left off, for $p$ being a 256 bit number $p$, the prime number theorem tells us that there are around $p/ln \ p = 2^{256}/ln(2^{256}) = 2^{256}/256 \ ln2 \approx 2^{248}$ primes. Put another way, a proportion $2^{-8}$ of 256 bit numbers are prime, this tells us that the proportion of strong curves over the field is about $2^{-8}$ of $2^{257}$ (since there are $2^{257}$ isomorphism classes of curves over a 256 bit field) which puts the number of strong curves at about $2^{249}$. Now the argument goes that if NSA knew how to break a proportion $2^{-40}$ of the strong curves, then there would be $2^{249} / 2^{40} = 2^{209}$ curves that NSA could break, but that the rest of the world believes to be secure.  In this case, $2^{209}/2^{257} = 1/2^{48}$ is the proportion of total curves that are known to be insecure only to the NSA. So if you go with the math presented, the argument goes that NSA could have found a known weak curve after $2^{48}$ seed input trials to SHA-1.

KM2015 then proceed to argue that for a 256 bit prime, $2^{209}$ is an extremely large number of weak curves to have gone undetected by the community for 18 years. The argument goes – assume the NSA could conduct week seed searches in $2^{48}$ work – if we take this to be true, we can then work backward with the prime number theorem etc. to arrive at the number of required weak curves for NSA to have “plausibly” searched for seeds producing desired weak curves in $2^{48}$ trials.

Deep Crack: Bespoke Silicon Circa 1998

Though KM2015 go on to cite other possible reasons for dropping p-256, including adaptations of Bernstein and Lange’s so called $n^{1/3}$  pre-computation based attacks, our focus here was to understand the capping of possible NSA SHA-1 trials / hashing power at $2^{48}$, and the derived requirement around $2^{209}$ weak curves.  Should such a bound on available hashing power be considered high or low given the state of 1997 technology? Alas, we must draw our own conclusions, but we present one important benchmark around the state of bespoke silicon in that era. Specifically, we cite the case of the EFF 1998 DES Cracker, colloquially called “Deep Crack”. This was a machine that brute forced the $2^{56}$ DES key space in a matter of days. Deep Crack was designed and built by Kocher et. al. at CRI with 1856 custom ASICS. The cost of the DES cracker is stated at $250,000 in 1998 dollars. Conclusions: It is not the intent of the author to promote the view that there are a large number of weak curves nor that the NSA exploited this fact to pre-determine SHA-1 seeds in order to ensure selection of weak curves and their eventual standardization through NIST. The situation with weak curves is quite different from that of Dual_EC_DRBG, since in some sense, eventual knowledge of such weak curves would confer an “equal opportunity” advantage to anyone with cryptanalytic ability, rather than bestowing an asymmetric advantage to any one party. Rather, the point of this post is to merely trace reasoning around existence of weak curves and provide more insight around quantifying the size of classes of purported weak curves. If one allows NSA to have about $2^{80}$ hashing power in 1997 rather than $2^{48}$, by the reasoning outlined above, this would imply the existence of about $2^{257}/2^{80} = 2^{177}$ weak curves which is a very large number and seems unlikely. In concluding, we echo the remarks of KM2015: “Although there is no plausible reason to mistrust the NIST curves, there are two reasons why it is nevertheless preferable to use other curves .. The first reason is public perception — even though it is unlikely that the NIST curves are unsafe, there has been enough speculation to the contrary that to keep everybody happy one might as well avoid the NIST curves. The second reason is that the other curves do have some advantages … It is no discredit to the NIST curves that more efficient alternatives are now available — after all, it has been 18 years, and it would be surprising if no one had come up with anything better by now” Fully homomorphic encryption has evolved from Craig Gentry’s original work based on worst case problems in ideal lattices and problems based on sparse subset sums, to newer schemes such as the BGV Ring Learning With Errors (LWE) scheme (based on lattice SVP) – This is a scheme where the plaintext space is taken to consist of integer subrings $\mathcal{O}_{\mathbb{Q}(\zeta_m)}$ = $\mathbb{A}_p$ : = $\mathbb{Z}[X]/\langle \Phi_m(X),p \rangle$ $\cong$ $\mathbb{Z}_p(\zeta_m)$ of cyclotomic number fields $\mathbb{Q}(\zeta_m)/\mathbb{Q}$. Of the lattice based schemes, other than the Ring LWE based BGV scheme, (named after Brakerski, Gentry and Vaikuntanathan) there is also the “LATV” scheme of Lopez-Alt, Tromer and Vaikuntanathan, based on the NTRU problem, and Brakerski’s “Scale Invariant” scheme. Notably, the BGV scheme introduces the novel modulus switching technique that, while ensuring the correctness of the scheme, allows the evaluator to scale down the magnitude of the noise in the ciphertexts without knowing the secret key. But here, we pause and take note of the original noise management technique in Gentry’s original work, which is commonly known as the bootstrapping theorem – it states that efficient fully homomorphic encryption can be constructed from a somewhat homomorphic scheme that “compactly” evaluates its decryption circuit. To expand on this notion, consider a ‘somewhat’ homomorphic scheme $\varepsilon$ with plaintext space P = {0,1}, a ciphertext $\Psi_1$ that encrypts a message $\pi$ under a public key $pk_1$, and a secret key $sk_1$ associated with $pk_1$. The key $sk_1$ is then encrypted under a second public key $pk_2$. Denote the encryption of the jth bit of $sk_1$ under $pk_2$ as $\bar{sk_1}_j$. Define a ‘re-encryption’ procedure for the ciphertext $\Psi_1$ as the following – $Recrypt_\varepsilon(pk_2,D_\varepsilon, \langle \bar{sk_1}_j \rangle, \Psi_1)$ : 1) $\bar{\Psi_1}_j \gets Encrypt_\varepsilon(pk_2, \Psi_1j)$ 2) $\Psi_2 \gets Evaluate_\varepsilon(pk_2, D_\varepsilon, \langle \langle \bar{sk_1}_j \rangle, \langle \bar{\Psi_1}_j \rangle \rangle)$ Where $D_\varepsilon$ denotes the scheme’s decryption circuit. The algorithm takes the bits of the ciphered message $\Psi_1$ (which is an encryption of the message $\pi$ under $pk_1$), and encrypts them under the ‘outer’ encryption key $pk_2$ to yield the ciphertext bits $\bar{\Psi_1}_j$. The decryption circuit is then homomorphically evaluated with the ciphered decryption secret $\bar{sk_1}$ (the secret key $sk_1$ encrypted under $pk_2$) and the ‘doubly’ encrypted message ciphertext $\bar{\Psi_1}$. This latter step removes the ‘inner’ encryption of the message $\pi$ under the key $pk_1$, thereby yielding the ciphertext $\Psi_2$ which is just the encryption of $\pi$ under $pk_2$. The bootstrapping theorem states that such an algorithm can be used as an effective noise reduction mechanism to create a fully homomorphic encryption scheme from a somewhat homomorphic one so long as the decryption circuit is “compact” and does not increase noise components to a level that results in ciphertext “wraparound” in the underlying ring. Another way to say this is that the scheme cannot work if the magnitude of ciphertext noise results in reductions modulo q during decryption, where q is a parameter that characterizes the ciphertext space. Interestingly, the decryption circuit associated with Gentry’s original somewhat homomorphic scheme was not sufficiently compact for bootstrapping fully homomorphic encryption. The curcuit was therefore “squashed” along with an approach where the party encrypting data initiates the homomorphic decryption process. In the context of privacy preserving outsourced computation, the Recrypt procedure described would typically be evaluated by an untrusted counterparty. Therefore, it cannot compromise the security of secret keys (This is the case since $\bar{sk_1}$ used by the decryption circuit is the encryption of a secret key rather than the secret key itself) [Todo: discuss circular security and security of key dependent messages]. Lets take a step back and consider some essential aspects of the problem and of Gentry’s original scheme – Fully homomorphic encryption (FHE) schemes allow for the computation of arbitrary functions on encrypted data viz. you can do anything on the data that can be efficiently expressed as a circuit. FHE schemes have eluded cryptographers for at least 3 decades. In 2009 Craig Gentry published an initial scheme based on the ideal coset problem instantiated over ideal lattices, that is lattices corresponding to ideals in polynomial rings of form $\mathbb{A} = \mathbb{Z}[X] / \Phi_m(X)$, consisting of residues of integer polynomials modulo the $m'th$ cyclotomic polynomial$latex \Phi_m(X) $of degree m. Specifically, in his initial somewhat homomorphic construction, Gentry employs computational assumptions around the ideal coset problem (ICP) – when instantiated over ideal lattices. this is equivalent to the so called the bounded distance decoding problem (BDDP). The second computational assumption used involves sparse subset sums – this arises out of a requirement to ‘squash’ the decryption circuit in order to address its essential non-compactness . At the time, lattices were chosen to further minimize complexity of the decryption circuit and facilitate the bootstrap while maximizing the scheme’s evaluative capacity – decryption in lattice based schemes is typically dominated by matrix vector multiplication (NC1 complexity). Contrast this to RSA, for instance, that requires evaluating modular exponents. To recap, a central aspect of Gentry’s result is precisely bootstrapping, which brings us to US patent 2011/0110525 A1 granted May 12 2011 “Fully homorphic encryption method based on a bootstrappable encryption scheme, computer program and apparatus” (Asignee IBM). Within the detailed description of the invention, the following is stated – “It is shown that if the decryption circuit of a somewhat homomorphic encryption scheme is shallow enough, in particular, if it is shallow enough to be evaluated homomorphically by the somewhat homomorphic scheme itself (a self-referential property), then this somewhat homomorphic scheme becomes “bootstrappable”, and can be used to construct a fully homomorphic scheme that can evaluate circuits of arbitrary depth.” The description within the remaining text of the patent is essentially a distillation of Gentry’s 2009 thesis. So on the face of it, it does look like bootstrapping is covered by the patent. There is the lingering question of the applicability of this (or other) patents to the “leveled” scheme of Brakerski, Gentry, Vaikuntanathan (BGV) published in 2011. This is a fully homomorphic scheme based on the learning with errors (LWE) or Ring-LWE (RLWE) problem/s. Rather than bootstrapping as a noise management technique, BGV starts with somewhat homomorphic encryption and uses tensor products along with a process called re-linearization and modulus switching to maintain the relative magnitude of ciphertext noise during homomorphic evaluation, but does propose bootstrapping as an optimization for evaluating deeper circuits that otherwise in the purely leveled scheme may require very large parameter choices for “aggregate” plaintext / ciphertext rings. Concerning references and the attribution of results – the work discussed spans multiple groups and individuals including Craig Gentry, Vinod Vaikuntanathan, Shai Halevi, Zvika Brakerski, Nigel Smart, F. Vercauteren, Adriana Lopez-Alt and Eran Tromer to name a few. This post picks up at the point of considering BGV (Brakerski, Gentry, Vaikuntanathan) encryption variants based on Ring-LWE. Thought it may seem natural to start by considering the BGV enciphering function, I will avoid discussing this since is actually quite simple in principle and essentially consists of polynomial mixing and sampling from an error distribution. Also, it is described well in a post on Windows on Theory by Boaz Barak and Zvika Brakerski that can be found here: http://windowsontheory.org/2012/05/02/building-the-swiss-army-knife/ Instead, I’d like to delve into the plaintext algebra of the “Ring-LWE” BGV variant in order to understand the encoding / representation of plaintext bits in the scheme . In this post, we consider the idea of batched homomorphic operations and so called ‘plaintext slots’, along with some of the basic algebra that supports the specification of plaintext slots. In particular, we consider the Ring-LWE BGV variant defined over rings of form: $\mathbb{A} = \mathbb{Z}[X] / \Phi_m(X)$ where $m$ is a parameter and $\Phi_m(X)$ is the $m' th$ cyclotomic polynomial whose roots are the primitive $m 'th$ roots of unity. The native plaintext space for this variant is taken to be the ring: $\mathbb{A}_2 = \mathbb{A}/2\mathbb{A}$ Note that the cyclotomic polynomial factors into irreducible factors mod 2 (which follows partially from the fact that the cyclotomic extension $\mathbb{Q}(\zeta_m)/\mathbb{Q}$ is Galois) : $\Phi_m(X) = F_1(X)\cdot F_2(X) \cdot \cdot \cdot F_r(X)$ mod 2 Lets cover some background around Chinese Remaindering and homomorphisms of rings – Consider $F(X) \in \mathbb{F}_2[X]$, a monic polynomial of degree that splits into distinct irreducible factors of degree d = N/r, viz. $F(X)=\prod_{i = 1}^{r}F_i(X)$. (In the case of $F(X) = \Phi_m(X)$ mod 2 we have $d \times r = \phi(m))$. Let $A$ denote the ring $A = \mathbb{F}_2[X]/(F(X))$ which is the quotient of $\mathbb{F}_2[X]$ modulo the ideal generated by $F(X)$ By the Chinese remainder theorem, there is the isomorphism: $A \cong \Bbb{F}_2[X]/F_1(X) \otimes \Bbb{F}_2[X]/F_2(X) \otimes\dots\otimes \Bbb{F}_2[X]/F_r(X)$ $\cong \Bbb{F}_{2^d} \otimes \dots \otimes \Bbb{F}_{2^d}$ So that $A$ is isomorphic to $r$ copies of $\Bbb{F}_{2^d}$ (fields of $d$ bit strings). This captures central idea around the of batching homomorphic operations -We start by encoding individual bits in the $r$ distinct plaintext fields $\Bbb{F}_{2^d}$. More precisely, we are not just confined to representing bits, but any element in the ring. For example, if 8 divides d, then we can represent a byte since in this case there is an embedding of $\mathbb{F}_{2^8}$ in $\mathbb{F}_{2^d}$. Then, rather than evaluating functions on the individual bits or ring elements, we use the isomorphism above and work in the ring $A$. Evaluating a function once in $A$ implicitly evaluates the function in parallel over the $r$ distinct plaintext spaces $\Bbb{F}_{2^d}$. The ring $\mathbb{A}_2$ is typically referred to as the “aggregate” plaintext space in BGV, and the sub-rings $\mathbb{Z}_2[X]/F_i(X)$ are referred to as the “embedded” plaintext spaces or “slots”. Some notes on the Brakerski and Vaikuntanathan’s (BV) homomorphic encryption scheme published in 2011, two years following Gentry’s publication of a fully homomorphic encryption scheme. BV achieves fully homomorphic properties from the standard learning with errors problem (LWE). Gentry’s original scheme was based on ideal lattices. This was partly motivated through natural homomorphisms that arise from using ideals in rings given the fact that they’re closed under addition and multiplication. BV, however, contend that the security of schemes based on ideal lattices rests on relatively untested cryptographic assumptions, in addition to which, the bootstrapping procedure in Gentry’s formulation requires “squashing” or simplifying the decryption circuit, which introduces an additional hardness assumption in the form of the sparse subset sum problem. The BV scheme, in contrast, is based on known worst case, classical hardness assumptions of standard problems on arbitrary lattices – the specific problem employed in this scheme is learning with errors (LWE), which states that given an n dimensional secret vector over an integer field, any polynomial number of “noisy”, random linear combinations of the coefficients of the vector are indistinguishable from uniformly random elements in the underlying field. Best known algorithms for LWE are almost exponential in dimension n of the secret vector. An artifact of cipher text multiplication in BV is that the resulting polynomial expression contains terms quadratic in the secret key. Therefore a key aspect of the scheme is so called “re-linearization” which involves publishing encryptions of linear and quadratic terms in the secret key in the resulting polynomial under a new secret key. The substituted expression is then linear in the new secret key. The scheme additionally relies on creating a chain of L secret keys along with encryptions of the quadratic terms of a given key in the chain under the next key. This allows for L levels of multiplications under the scheme. Similar to Gentry’s original blueprint, BV starts with a somewhat homomorphic scheme based on standard LWE, and then creates a bootstrappable scheme that inherits initial scheme’s homomorphic properties and utilizes dimension modulus reduction for managing cipher text noise during successive evaluations up to the prescribed multiplicative depth. From a security standpoint, dimension modulus reduction does not significantly affect Regev’s reduction of Decision LWE (DLWE) to approximate, worst case short vector problems (SVP) on reduced dimensional lattices, thereby maintaining security of the initial scheme. So what’s the relationship to existing patent’s for Gentry’s scheme? Not sure. Next we will discuss aspects of the Brakerski, Gentry, Vaikuntanathan (BGV) scheme that among other things, is a “Ring-LWE” adaptation of BV that incorporates Smart Vercauteren batch operations (“cipher text packing”) by defining the message space to consist of elements in a polynomial ring. ‘Entropy’, in the context of the Linux Random Number Generator (‘LRNG’), is an accretional quantity arising the uncertainty of hardware generated interrupt time stamps. If you’re already familiar with the mechanism, bear with me as I provide some intuition – consider timestamp parity and think of the kernel appending 0 to a stored register value if the millisec counter on an interrupt timestamp is even parity, or 1 if it’s odd, and then generating a string of random bits this way after witnessing many interrupts. The string of resulting bits can then be read out of the register for applications that require randomness, at which point the value is reset to zero. If we admit that even parity on timestamps is as likely as odd, it’s clear that the uncertainty (i.e., the entropy) in bitstrings from this construction is a function of the length of the strings : The greater the number of observed interrupts -> the longer the string -> the higher the entropy. Here is a simple illustration – assume that I create a 3 bit random string by observing the parity of three successive interrupt time values (measured in milliseconds since system boot). If these time values are $5000090, 5000301$ and $5001403$, I create the 3 bit random string by taking each of the residues of these values modulo 2, i.e I take the values$5000090 mod 2 = 0$,$latex 5000301 \mod 2 = 1$,$latex 5001403 mod 2 = 1$to generate the string$latex 011\$ which I then store in a secret register. Without any other information, it’s reasonable to say that your chances of guessing this string are 1 in 8 (i.e. 1 in $2^{3}$). In the parlance of Information Theory, we say this string has 3 ‘bits’ of ‘entropy’. In a similar vein, the entropy of a four bit string generated as such would be 4 ‘bits’ owing to greater uncertainty in it’s value.

The LRNG does not actually generate random bits this way (that is, by concatenating interrupt millisec time values modulo 2). Roughly, the actual method is based on uncertainty in interrupt times represented as ‘jiffies’, 32 bit words representing time of an interrupt in millisecs since system boot, along with a ‘high resolution’ time value obtained from the CPU. These values are ‘mixed’ into the entropy ‘pools’ from where an extraction function creates sequences of pseudo-random bits on demand. This is concretely described in Gutterman, Pinkas et. al’s description of breaks to backtracking resistance of the LRNG [GPR 2006]. The goal of this piece, however, is not to address weaknesses like the analytic breaks discussed above, it’s to try and address looming confusion on a couple of fronts including (i) general suitability of the LRNG pools, including the non blocking ‘urandom’ pool, for ciphering (ii) issues with the quality of randomness arising from virtualization, including arbitration and biasing of otherwise random processes by a hypervisor hosting multiple guest systems on shared hardware (iii) the conflation of virtualization and biasing effects mentioned above, with the effects of app containerization on a single VM.

Rather than expound on all the above, let’s address the suitability of the LRNG for ciphering, after which, perhaps we outline a discussion on real or perceived biasing of timestamp values due to a hypervisor interfering with these values. We will also touch upon the impact and potential for entropy loss from interrupt event correlation across multiple VMs sharing hardware. I don’t think we need to address containers, there’s a good post describing those effects here.

First, let’s talk about the LRNG’s entropy estimation ‘heuristic’, this becomes important when we consider the problem in light of NIST requirements for seeding pseudo random number generators (PRNGs) – The LRNG consists of three pools, the primary, secondary and urandom pools. Concretely, the term ‘pool’ refers to m-bit arrays sized at 4096, 1024 and 1024 bits respectively. Entropy is ‘mixed’ into the primary and sometimes to the secondary pool. Based on this, an entropy count for each of the pools is incremented. It should be clear that the maximal entropy possible for the primary pool is 4096 bits. To expand on intuition provided earlier, If we think of each bit in a 4096 bit string as being equally likely (i.e., each bit is uniformly generated), then each of the 2^4096 variants of this string is equally likely giving us a maximal uncertainty, or ‘entropy’, of 4096 bits for this string. When we say that a string has 4096 ‘bits’ of entropy, we mean to say there is a hidden string that can take on 2^4096 distinct values and your chance of guessing the actual value is 1 in 2^4096 . Importantly, quantifying entropy requires that (i) we can estimate a distribution on the string (as we did above by assuming each bit in the string is uniformly generated), (ii) we choose an estimate of entropy such as so called Shannon entropy, or alternatively, a more conservative estimate favored by NIST, the ‘min’ entropy.

The next thing to consider is how do we estimate or model a distribution on the events used to generate random bits in this string. In the case of the toy example above where we consider parity of interrupt times, a natural model may be to use the uniformity assumption, i.e., it’s just as likely that a timestamp has even or odd parity. But the LRNG doesn’t consider uncertainty in the parity, rather the random variable considered is a 32 bit word representing time of an interrupt in millisecs since system boot along with a time value provided by the ‘high resolution’ CPU timer, instances of which are mixed into the entropy pools along with auxiliary data. What is a reasonable distribution for this variable (actually, a more important question is what is the distribution of the **low order bits** of this variable), and is it something that exhibits short and long range uncertainty? Hard to say, it would depend on a number of factors including the type of device and it’s  usage pattern – it’s a little deceptive to think of the ‘exact’ entropy of a random variable of this nature – this is something that can’t truly be known owing to its inherent uncertainty. Since we lack a priori knowledge of its distribution, it’s something that must be estimated by collecting a sufficiently large sample of data and fitting a distribution or using related min entropy estimation methods (‘binning’, collision estimates etc.). Based on this, the GPR2006  authors estimate about 1.03 bits of entropy per interrupt event based on a collection of 140K events (though the exact entropy estimate used is unclear).  Later, we will look at other studies that sample block device I/O events for the purpose of estimating the entropy of these events in a VM running on a variety of hypervisors.

From a concrete standpoint, the LRNG does not use min entropy estimation, rather, the estimation used is a heuristic based on taking the logarithm of inter-arrival times of interrupts along with their ‘finite derivatives’. The details are not important, the point is that the heuristic potentially runs afoul of NIST recommendations which require min entropy estimates of PRNG seeds. On the bright side, GPR2006 and other studies allude to the fact that the RNG’s estimation heuristic is likely quite conservative. Other than entropy estimation, a couple of other things to consider are (i) periodicity, structural and algorithmic aspects of the LRNG state and entropy extraction functions and comparison to other NIST functions for generating pseudo random sequences (ii) initial seeding and use of the generators, especially during critical periods following system boot / reset, along with NIST seeding requirements vis a vis the ‘instantiation strength’ (Section 12.1) of generators.

First, lets consider the structural and algorithmic aspects of the LRNG. Though the LRNG is known to provide good random data for cryptographic applications, a head to head comparison with NIST PRNGs (section 12.1) is not truly possible, and it’s difficult to impute suitability of the LRNG for ciphering functions by looking at comparable guidance for the NIST PRNGs. For the non blocking urandom pool in particular, we can say that (i) it has no direct interaction with noise sources (ii) has a ‘mixing function that incorporates a linear feedback shift register (LFSR) generating elements in field extensions (we describe this later) (iii) uses SHA-1 for output extraction from its internal pool. This makes it structurally somewhat different from the NIST SP 800 90A generators as well as some of the other Linux generators such as the ‘jitterPRNG’ which is based on variation in execution times for a fixed set of CPU instructions.

We gain a  coarse intuition for thinking of the periods of deterministic (pseudo random) generators by considering their outputs as elements in large finite fields. Specifically, consider a field of order $2^{1024} - 1$  we can think of 1024 bit strings as elements in this field. To do this, however, we need to fix a ‘representation’. Concretely, we do this by thinking of the field as an extension formed by the quotient of the ring of all univariate binary polynomials over $GF(2)$ (i.e. with coefficients in ${0,1}$) by the ideal of a “primitive” binary polynomial of degree 1024 (which we can call $f(x)$ ), that is, $\mathbb{Z}[X] / f(x)$. The ‘raw’ output of the generator is then simply an element in this field (of which there are $2^{1024} -1$ unique non zero elements). Now, we need a way to generate elements, and since we’ve fixed a representation with the primitive polynomial $f(x)$, we can use the obvious cyclic generator; powers of $x \mod f(x)$. From an implementation standpoint, if $z$  is a root of $f(x)$ in the extension, we can fix a representation as binary polynomials in $z$ of maximal degree 1023 (that is, “adjoining” powers of $z$ to the base field). If we refer to elements in this representation as $y_i$, then starting with an arbitrary element, the following recurrence $y_{j} = z \times y_{j - 1} \mod f(z)$ generates all unique $2^{1024} - 1$ non zero elements in the extension. The following example illustrates the process with $f(z) = z^{4} + z + 1$, a degree 4 primitive trinomial evaluated at $z$. Field elements are represented as 4 bit strings. In this representation, the polynomial $1$ is represented as 0001, $z$ as $0010$, $z^{2}$ as $0100$, $z^{3}$ as $1000$, $z+1$ as $0011$ etc. We use the (above) recurrence $y_{j} = z \times y_{j - 1} \mod f(z)$, for $f(z) = z^{4} + z + 1$ to generate 5 successive elements starting with $y_{1} = 0001$  :

$y_2 = z \times y_1 \mod (z^{4} + z +1) = z \mod (z^{4} + z +1) = 0010$

$y_3 = z \times y_2 \mod (z^{4} + z +1) = z \times z \mod (z^{4} + z +1) = z^2 \mod (z^4 + z + 1) = 0100$

$y_4 = z \times y_3 \mod (z^{4} + z +1) = z \times z^2 \mod (z^{4} + z +1) = z^3 \mod (z^{4} + z + 1) = 1000$

$y_5 = z \times y_4 \mod (z^{4} + z +1) = z \times z^3 \mod (z^{4} + z +1) = z^{4} \mod (z^{4 }+ z + 1) = z +1 = 0011$

$y_6 = z \times y_5 \mod (z^{4} + z +1) = z \times ( z+1) \mod (z^{4} + z +1) = z^{2} + z \mod (z^{4} + z + 1) = 0110$

Clearly the period for this method is $2^{n} - 1$ for unique non zero n bit sequences (where $n = 4$ above), but there is a question around sub – periods within the larger sequence, for example, sub – periods within parts of the sequence that represent low order coefficients in the y_i. Rather than delve into this, I’ll state that directly taking such a sequence as output is not secure. Rather, the sequence should be processed through an extraction function that typically hashes and ‘folds’ all of the bits. So the ‘Galois’ RNG just described is capable of generating a long series of n bit sequences, but the reader may ask why not just monotonically increment a n-bit counter – the obvious answer being that this method is totally predictable. But this observation leads us to a derived construction that exploits so called ‘Tausworthe’ sequences of zeros and ones. These are sequences that are obtained by evaluating successive elements in an extension field where evaluation is done in GF(2). So in the example above, we obtain the 5 bit Tausworthe sequence 1 1 1 0 0 by summing all the ‘1’s modulo 2 in each of the $y_{1}, …, y_{6}$. As it turns out, sequences generated in this way have desirable statistical properties and a long period. An earlier generator based on such sequences was described in 1973 by Lewis and Payne with period typically equal to the degree of a primitive ‘trinomial’ used to define its underlying field, and a variant of this described by Matsumoto and Kurita (and used by the LRNG) the ‘Twisted GFSR’ generator, extends this period and has better statistical  uniformity of output – which brings us to the entropy input or the ‘seed’ for such a generator. A way to think about the seed of a generator like this is as an ‘index’ into the sequence that determines the initial position for generating the next sequence such that the uncertainty around the selection of this initial position is 2^n where the bit entropy of the seed is given by n. Since the output of the generator has good randomization properties (such as uniformity of output), the method has the benefit getting rid of any biases that exist in the seed. The LRNG utilizes a variant of the Twisted GFSR generator mentioned above, and roughly speaking, mixes entropy into its pools by continuously re-seeding the generator and mixing its output into the pools (this fact modifies its period in an indeterminate way). Another way to think of it is that the LRNG uses this mechanism for debiasing raw entropy input coming from interrupt events prior to mixing these events into the pool.

Finally we consider the effects of virtualization on the collection and entropy along, with the initial seeding of the generator. On the subject of seeding, we cite guidance from Stephan Muller in a proposal for a new generator motivated by a study on the effects of virtualization on entropy collection. In this proposal, Muller states that –    “[the] DRBG is seeded with full security strength of 256 bits during the first steps of the initramfs time after about 1.3 seconds after boot. That measurement was taken within a virtual machine with very few devices attached where the legacy /dev/random implementation initializes the nonblocking_pool after 30 seconds or more since boot with 128 bits of entropy”.

It should be explicitly said that the term ‘DRBG’ above refers not to the existing LRNG, but to the entropy collection mechanism for a proposed kernel generator based on well studied NIST SP 800 90A algorithms (but excluding the NIST ‘Dual-EC DRBG’ algorithm). Whereas the proposed generator collects entropy quite rapidly, based on this study, rate of collection is much lower for the LRNG which, strictly speaking, does not have sufficient entropy for applications requiring more than 128 bits of entropy (eg. AES 256 bit key generation, secret key generation on the p384 elliptic curve etc. ) at 30 seconds past boot. This situation can be alleviated in some cases by configuring systems to persist a seed at system shutdown that is re-read during boot (and this requires that the seed is kept secret), but persisting a seed is obviously not an option with strategies for continuous ‘repaving’ of systems.

When thinking about virtualization and the accumulation of entropy, we note that hypervisors typically broker access to hardware, and in some cases, abstract underlying hardware to VMs. Brokering of access is typically done either by serializing access requests from multiple guests to a hardware resource, or in some cases, by exclusively assigning access to a single VM. Exclusive or serialized access to hardware is desirable from a cryptographic standpoint since it implies that two guests on the same hardware cannot arrive at a shared view of the state and events generated by this hardware. This is obviously a good thing since from the definition of entropy, any sharing and reuse of timing information among more than one party has the effect of negating all the uncertainty in this information. This BSI study (also authored by Muller) hypothetically poses scenarios where errors in estimation of entropy through the estimation heuristic may occur. The thinking here is that this can arise due to delays added by a hypervisor’s handling of hardware timing events. Shortly, I’ll cite some results from this study around actual entropy measurements obtained by sampling numerous block device events on a variety of hypervisors.

In theory, a hypervisor can emulate most devices, such as a block I/O device with spinning platters (so called ‘rotational’ devices), by converting a VM’s request to such a device to another type of request (eg. a network call) that bears no similarity with the original hardware. It’s important to note this – the kernel does not, for example, consider events generated from flash storage for entropy collection owing to their higher predictability  – whereas this is a safeguard that could be obviated by a hypervisor that provides block device emulation of flash storage or a ramdisk. In other cases, the hypervisor may return data corresponding to a block I/O request by retrieving it from a cache rather than accessing the device, a fact that may materially impact uncertainty of related timing events.

The BSI study referenced above provides concrete numbers around the statistical properties of entropy sources in a virtual environment. A quick recap of some of the results for block I/O devices are summarized below:

Fig. 11 in the reference shows 0.43 bits of entropy in the ‘jiffies delta’, i.e., the difference between successive block I/O interrupt time stamps (measured in millisecs) based on 100K samples for a block I/O device on the KVM hypervisor without buffer caching. This figure shows that the majority of the jiffies deltas are either zero or one therefore containing a high degree of predictability. Fig 14, on the other hand, assigns a Shannon entropy of 16 bits to the 22 low bits of the high-resolution timestamp delta, showing that the low bits of the differences of high resolution timestamps provide the bulk of entropy for block devices. The author also points out that this is greater than the max. assignable value of 11 bits of entropy to a single delta by the estimation heuristic, which supports the statement that the latter method is indeed a conservative estimate for a virtualized environment of this type. Remarkably, figure 19 in this reference shows that the low 22 bits of the block I/O high resolution timestamp difference with hypervisor caching enabled contain about 15 bits of Shannon entropy, which is almost the same as the entropy conveyed in the previous statistic with caching disabled. This implies that buffer caching of data by the KVM hypervisor does not seem to alter inherent uncertainty of this phenomenon. Figures 23 and 28 show similar distributions for block device I/O on Oracle’s VirtualBox with Shannon entropy of the low 22 bits at 13.7 and 14.95 bits respectively for buffer caching disabled vs. enabled scenarios where caching in this case has slightly increased the uncertainty of the events. The Block I/O experiments with caching enabled do not, however, discuss the level to which device reads actually result in cache hits which has material impact on access rates to the hardware. Similarly, fig. 32 shows 15.12 bits of entropy per event on a Microsoft hypervisor, whereas fig. 39 measures 13.61 bits of Shannon entropy per delta on VMWare ESX. Another point I should make is that with this study, the experiments measure timing on a single VM, they do not look at possible correlation of timestamp events on many VMs (for eg., two VMs on shared hardware requesting access to the same block in near simultaneity) – though theoretically, serialized access to the hardware will likely result in different timestamp values for the access.

In concluding, we’ve gone over the basics of entropy measurement and collection in the LRNG, and started a discussion on the effects of virtualization on timing events that provide the raw uncertainty for this process. Future work could include (i) a more detailed discussion of the LRNG mixing function (ii) more analysis on measurements of initial seed entropy of the non blocking urandom pool immediately after system boot. Based on my current understanding, this pool does not wait for sufficient entropy accumulation before transferring bits to calling applications, thereby violating NIST requirements around the entropy of seeds vis a vis the ‘instantiation strength’ of a PRNG (In my view this is the main issue with urandom rather than its periodicity) (iii) more analysis on the impact of virtualization on timing events not related to block device access (iv) my memory is hazy on this point, but I recall that RedHat may have obtained FIPS certification for a kernel RNG, it may be worthwhile to look into this. It was almost certainly not the LRNG, but another generator based on NIST SP 800 90A deterministic algorithms and NIST 800 90B entropy tests.