Deep Learning
and
Virtual Tensor Machines

June 11. 2017

Žiga Sajovic, Xlab

Von Neumann languages do not have useful properties for reasoning about programs. Axiomatic and denotational semantics are precise tools for describing and understanding conventional programs, but they only talk about them and cannot alter their ungainly properties. Unlike von Neumann languages, the language of ordinary algebra is suitable both for stating its laws and for transforming an equation into its solution, all within the language.

- John Backus


Introduction

When John Backus was awarded the Turing award in 1977 for his work on the Fortran compiler, he followed it with the lecture that became known as his apology for having done it. In the lecture, he condemned von Neumann languages (a set Fortran definitely belongs to) and the very architecture they are implemented for. He delved deeper and touched on the algebra of programs, or better yet, the lack of it, calling for the ability of languages to reason about the programs they implement.

With recent developments in hardware catering towards deep learning (ex. see Nvidia and Google), his von Neumann bottleneck concern is thoroughly addressed by the field. Yet his concern pertaining algebra and reasoning still tinders, as mathematical treatment of our models is usually thought of as separate from the language that implements them. A theory that would capture the mechanics of deep learning as an algebraic language would not only assist the development of tensor processing units, but also enhance the ability for the desired reasoning.

The development of an algebraic language deep learning is contained in has been a recurring subject in our research. As the field is populated by gradients and Jacobians, we are faced with the need for a language that allows analytic deductions through algebraic means; a need to be met by operational calculus and tensor algebra. We present a high level overview of some of our findings, inviting those who are interested in an in-depth treatment of the mechanics to read Operational Calculus on Programming Spaces. We will not proceed by directly treating deep learning, yet still, by seeking the simplest and most efficient extension to existing formalisms, arrive precisely at the calculus deep learning is contained in.

Programs as maps on the memory space

We begin by thinking of our model as an abstract concept, as it will shortly be made clear of how it is to be manifested, through examples of state of the art projects that employ some aspects of it.

If we only focus on the real valued variables (of type float or double), the set of all possible states of the program’s memory, can be modeled by a finite dimensional real vector space \(\mathcal{V}\), a familiar concept to those in touch with current research on differentiable computing. As such, the effects of a program on its memory space \(\mathcal{V}\) can be described by a map \[P:\mathcal{V}\to\mathcal{V},\] motivating our starting definition.

The tuple \((\mathcal{V},\mathcal{F})\) is an Euclidean virtual machine, where
  • \(\mathcal{V}\) is a finite dimensional vector space over the field \(K\), serving as memory
  • \(\mathcal{F}<\mathcal{V}^\mathcal{V}\) is a subspace of the space of maps \(\mathcal{V}\to\mathcal{V}\), called programming space, serving as actions on the memory.

At first glance, the Euclidean machine seems to be a description of functional programming, with its compositions inherited from \(\mathcal{F}\). An intended impression, as we wish for our construction to be afforded the many fruits functional programming has gifted computer science. To that effect, we assume the standard model of \(\mathcal{F}\), with differentiability being the only restriction imposed on it, for now, as we turn our glance to the memory space \(\mathcal{V}\).

Virtual memory space

The memory space of a program is rarely treated as more than a storage. But to endow the Euclidean machine with extra structure, this is precisely the place to look to, while retaining back compatibility with the standard model of functional programming. Loosely speaking, functional programming is described by monoids, so a (multi)linear algebra description of the memory space would seem to be the step to take in attaining the extra structure.

We desire our memory space to be the most general algebra containing \(\mathcal{V}\), in the sense of the corresponding universal property. We start with the memory space \(\mathcal{V}\) and its dual \(\mathcal{V^*}:\mathcal{V}\to K\), that maps from the memory to the field over which it is defined and let the virtual memory space be defined by \[\mathcal{V}_k=\mathcal{V}_{k-1}+(\mathcal{V}_{k-1}\oplus\mathcal{V^*})\tag{1}\label{eq:memRec},\] with initial condition \(\mathcal{V}_0=\mathcal{V}\). The space that satisfies the relation \(\eqref{eq:memRec}\) is \[\mathcal{V}_k=\mathcal{V}\otimes T_k(\mathcal{V^*}),\tag{2}\label{eq:memSpace}\] where \(T_k(\mathcal{V^*})\) is the subspace of the tensor algebra of the dual of the memory space \(\mathcal{V}\), consisting of linear combinations of tensors of rank less or equal to \(k\). As such, the virtual memory space \eqref{eq:memSpace} is a map from the memory to itself, \[\mathcal{V}_n:\mathcal{V}\to \mathcal{V}\] which, for \(W\in\mathcal{V}_n\), we define to be \[W(v)=w_0+w_1\cdot v+\cdots +w_n\cdot(v)^{\otimes n}\tag{3}\label{contraction}\] the sum of multiple contractions (noting that the expressionistic convenience of \(W(v)=w_0+w_1\cdot v\), has already been adopted in contemporary research). By such a construction, expansions and contractions of the memory space, reminiscent of the breathing of the stack, would hold meaning parallel to storing values, motivating a new definition.

Let \((\mathcal{V},\mathcal{F})\) be a Euclidean machine an let \[\mathcal{V}_\infty=\mathcal{V}\otimes T(\mathcal{V^*}),\] where \(T(\mathcal{V^*})\) is the tensor algebra of the dual memory space. We call \(\mathcal{V}_\infty\) the virtual memory space of a virtual computing machine \((\mathcal{V},\mathcal{F})\).

By seeking a natural extension to functional programming, we converged to objects in terms of which deep learning is thought of. Such a virtual memory space is by itself the most efficient algebra of the memory, with its multi-linear mappings being ripe for parallel execution on the GPU, which is already throughly exploited. Yet, the nature in which it was derived seems to imply that there is something to be discovered about such machines, if we examine their memory for what it is beyond being an \(n\)-dimensional array.

Memory as a space

Treating the space itself as an object, the stored values can be seen as properties at a point (ex. what the particular basis element is being tensored with). These properties are best described by distributions over the space, where the deterministic behavior of the memory is recovered by simply imposing the dirac distribution at that point. Such concepts are already familiar in contemporary research (see handling of register values in Kurach et al.). Reading is done by flipping a coin (or working with the distribution itself) and writing by modifying the distributions; operations that can be made differentiable by various reparametrization tricks (ex. see section 2.4 in Jaderberg at al.). Moreover, we are able to come to terms with enumerable memory locations upon transitioning from theory to practice, by having patches of space share distributions.

Viewing the memory as a space affords us a framework in which we can manipulate it through established methods from attention models, enhanced by methods from computer vision. A similar approach has already been used in contemporary research for manipulating inputs by Spatial Transformer Networks, that have been further integrated in other projects for ROI selection to great success (see DenseCap).

Spatial transformation

Example

A simply constructible experiment is to apply these methods to the memory itself. As a toy demonstration, we construct a network with a shared external memory, used by all layers for reading and writing. As all layers have the ability to read and write to any part of the memory, this has a similar effect to skip and residual connections frequently used today. When used with recurrent networks, it even allows gradients to flow directly between different time steps, skipping the steps in between. Moreover, as layers can decide to read and write to different parts of the memory, when processing conceptually different inputs, this may lead to formation of clusters in the memory, similarly to how this is observed with word2vec in a slightly different environment.

For the sake of simplicity, we modify well known models to achieve our construction. Let \(M\) be a rank \(2\) tensor serving as memory and \(C\) be a word from it (similarly to how a byte is treated as a word of memory). With \(x\) standing for the input, we construct a layer from two LSTM layers (hereon denoted by \(L_i\)) as follows. First part is the controller, that produces the transformation parameters \(\Theta\) \[\Theta^t,C_1^t=L_1(\Theta^{t-1}_1, x, C_1^{t-1})\tag{4}\label{L1}\] where \(C_1\) is the hidden cell of \(L_1\). We than feed the \(\Theta\) to a spatial transformer, along with the memory \(M\), which selects our word \(C_2\) \[C^{\prime t}_2=STN(\Theta^t,M^{t-1}),\tag{5}\label{wordSel}\] which servers as a cell of \(L_2\) at the current time step \[h^t,C_2^t=L_2(h^{t-1},x,C^{\prime t}_2),\tag{6}\label{L2}\] producing the output \(h^t\) and the updated cell state \(C^t_2\). Now instead on simply holding the state, we write it to the same location we have read it from in \eqref{wordSel}, updating the memory state \(M^{t-1}\to M^t\). We denote such a layer spatially transformed memory \[STM: (\Theta^{t-1},x,C^{t-1}_1, h^{t-1},M^{t-1})\to(\Theta^t,C^{t}_1, h^t M^{t})\tag{7}\label{STM}\] in honor of LSTM and Spatial Transformer Networks it is derived from.

Experiments

We trained a small \(3\) layered \(STM\) \eqref{STM} on the simple task of character generation (trained on the corpus of Nietzsche), to observe the activity in the memory. The memory \(M\) was a \(20\) by \(20\) grid and the word \(C\) a \(10\) by \(10\) grid (which is later flattened, to be used in the LSTM). The network generates \(10\) characters at a time.

After a few hours of training (Adam + scheduling was used), we see that each layer treats the memory \(M\) differently, with its own pattern of reading and writing. The first layer spreads its reading and writing across most of \(M\), as indicated by the blue colored cells in the figure bellow.

Memory activity at layer \(1\)

The second layer performs its operations on a more concentrated patch of memory, and moves across it through time, as the network generates characters.

Memory activity at layer \(2\)

The final, third layer (preceding the softmax) can be seen to operate on multiple contiguous blocks of memory, which are no longer spread out, like they were in the previous two layers. This seems to indicate clustering of the data in the memory, mentioned before in reference to word2Vec.

Memory activity at layer \(3\)

We observe that the sentences generated by \(STM\) \eqref{STM} are more sensible and connected than those produced by regular LSTM (with comparable number of parameters) and conjecture this is due to the ability of the network to cluster "remembered" data in the memory \(M\) and the fact that said locations can be shared between different layers at different time steps.

Code

For those interested in playing with the example we include this code snippet to help get you started. It includes the \(STM\) layer and the code that produces the visualizations.

A note on training

Spatial transformers are usually employed to sample patches from existing images (i.e. non-zero tensors), while with \(STM\) we are starting with an empty memory \(M\) (i.e. zero tensors). Thus special care must be taken with initializations, so the memory \(M\) is thoroughly filled from the start, to facilitate information and gradient flow between different layers at different time steps.

Naive training converges to an LSTM

Thus when training \(STM\), naive methods can cause the network to always read and write to the same locations in the memory \(M\), effectively converging to an LSTM network, as seen in Figure 5. The choice of the activation functions to be used when producing parameters \(\Theta\) has an effect on the way the network distributes data in the memory. This means that they can be used as a means of its control by the programmer, but naive choices can again lead to unwanted behavior in said distributions.

Programming spaces

Motivated by our definiton of the virtual memory space we define the following function spaces \[\mathcal{F}_n=\{f:\mathcal{V}\to\mathcal{V}\otimes T_n(\mathcal{V}^*)\},\] where the Fréchet derivative is hereon denoted by \(\partial\), with higher derivatives simply being the images of its higher powers \[\partial^k:\mathcal{F}^n\to\mathcal{F}^{n+k}.\] We are now ready to restrict our set \(\mathcal{F}\) to best suit our needs. As the derivatives need to be implementable in our machine we must impose closure under the differentiable operator \(\partial\) on the set.

A differentiable programming space \(\mathcal{P}_0\) is any subspace of \(\mathcal{F}_0\) such that \[\partial\mathcal{P}_0\subset\mathcal{P}_0\otimes T(\mathcal{V}^*)\tag{8}\label{progSpace}.\] The space \(\mathcal{P}_n<\mathcal{F}_n\) spanned by \(\{\partial^k\mathcal{P}_0;\quad 0\leq k\leq n\}\) over \(K\) is a differentiable programming space of order \(n\).
The definition of higher order differentiable programming spaces is justified by the following theorem.
Any differentiable programming space \(\mathcal{P}_0\) is an infinitely differentiable programming space, meaning that \[\partial^k\mathcal{P}_0\subset\mathcal{P}_0\otimes T_k(\mathcal{V}^*)\tag{9}\label{difProgSpace},\] for any \(k\in\mathbb{N}\).
A simple proof by induction can be found here.
This means that any \(n\)-differentiable programming space \(\mathcal{P}_n\) can be be embedded into the tensor product of the function space \(\mathcal{P}_0\) and the space \(T_n(\mathcal{V})\) of multitensors of order less than or equal to \(n\) \[\mathcal{P}_n<\mathcal{P}_0\otimes T_n(\mathcal{V}^*).\tag{10}\label{embbedding}\] In plain tongue, the theorem states that if first derivatives are implementable with some set, all higher derivatives are as well.

Virtual Tensor Machines

As a result of the theorem of Infinite differentiability, the tuple \(\langle\mathcal{V},\mathcal{P}_0\rangle\) and the structure of the tensor algebra are sufficient to construct a differentiable programming space \(\mathcal{P}_n\) of any order \(n\) by linear combinations of elements of \(\mathcal{P}_0\otimes T(\mathcal{V}^*)\). This allows us a concise definition of a framework allowing analytic study of programs through algebraic means.

The tuple \(M=\langle\mathcal{V},\mathcal{P}_0\rangle\) is a virtual tensor machine, where
By composing contractions \eqref{contraction} with activation functions \(\phi\in\mathcal{P}\), we see that tensor networks \[ \mathcal{N}(v)=\phi_k\circ W_k\circ\cdots\circ\phi_0\circ W_0(v),\tag{11}\label{tensorNetwork} \] are basic programs in the virtual tensor machine, where the standard neural networks are retained by imposing \(\forall_i(W_i\in\mathcal{V}_1)\). The formulation also holds convolutional models upon generalization and will be appropriately noted where necessary.
It can be stated that libraries such as TensorFlow are implementations of virtual tensor machines containing certain subsets of the differentiable programming space \(\mathcal{P}_1<\mathcal{P}_0\otimes T(\mathcal{V}^*)\).

Operational calculus on Programming spaces

The next thing our machine needs is a way to shift functions from their initial values, which can later be used to implement iterators and composers. We proceed similarly to how we shifted functions in a previous post on recurrence.

Tensor series expansion

We define the generalized shift operator to be \[e^{h\partial}=\sum\limits_{n=0}^\infty\frac{(h\partial)^n}{n!}.\] In coordinates, the operator \(e^{h\partial}\) can be written as a series over all multi-indices \(\alpha\) \[ e^{h\partial}=\sum\limits_{n=0}^{\infty}\frac{h^n}{n!}\sum_{\forall_{i,\alpha}}\frac{\partial^n}{\partial x_{\alpha_1}\ldots \partial x_{\alpha_n}}e_i\otimes dx_{\alpha_1}\otimes\ldots \otimes dx_{\alpha_n}, \] an overflow of indices, which can luckily be though of simply as \(e^{h\partial}\), due to its nice properties.
For a program \(P\in\mathcal{P}\) the expansion into an infinite tensor series at the point \(v_0\in \mathcal{V}\) is expressed by multiple contractions \[ P(v_0+hv)=\Big((e^{h\partial}P)(v_0)\Big)(v). \tag{12}\label{tenSeries} \]
The proof can be found here.
The above theorem can be extended to convolutions, by redefining the \(\partial\) operator and employing the Volterra series.
As the operator \(e^{h\partial}\) defines a map \[e^{h\partial}:\mathcal{P}\to\mathcal{P}_n,\] it thus by \eqref{embbedding} also defines a map \[ e^{h\partial}:\mathcal{P}\times\mathcal{V}\to\mathcal{V}\otimes T(\mathcal{V}^*), \] from the virtual machine tuple, to the virtual memory space, meaning that any program \(P\in\mathcal{P}\) can be expressed as a series of contractions of the memory, as defined in \eqref{contraction}, \[ P(v_0+v)=W(v)\tag{13}\label{wTenSeries}, \] for some \(W\in\mathcal{V}\otimes T(\mathcal{V}^*)\), as \[ e^{h\partial}(\mathcal{P}_0)\subset\mathcal{P_0}\otimes T({V}^*),\tag{14}\label{contained} \] which opens many interesting directions to explore in how deep learning can benefit general computer science and program analysis.
To cope with the finiteness of any implementable machine, it is common to use the projection of the operator \(e^{h\partial}\) to the first \(n\) basis vectors \(\{\partial^i\}\), hereon denoted by \(e^{h\partial}_n\).

Program composition

It can be shown that not only all maps \(P\in\mathcal{P}\) are natural elements of the virtual memory, but so are all their compositions, showing this view to be complete.

Composition of maps \(\mathcal{P}\) is expressed as \[ e^{h\partial}(f\circ g)=\exp(\partial_fe^{h\partial_g})(g,f), \] where \(\exp(\partial_fe^{h\partial_g}):\mathcal{P}\times\mathcal{P}\to\mathcal{P}_\infty\) is an operator on pairs of maps \((g,f)\), where \(\partial_g\) is differentiation operator applied to the first component \(g\), and \(\partial_f\) to the second component \(f\).
The proof can be found here.
The above theorem can be extended to convolutions, by redefining the \(\partial\) operator and employing the Volterra series, as it was done in the Convolutional series expansion remark.
Both forward and reverse mode of automatic differentiation can be implemented by the compositional operator by fixing the appropriate mapping in the tuple. Those interested in a more in depth explanation are invited to find it here, as we will focus on the corollary \[ e^{h\partial}\vert_{v_0}(p_n\circ\cdots\circ p_0)\] \[=\tag{15}\label{compositions}\] \[e^{h\partial}(p_n)\vert_{v_n}\circ\cdots\circ e^{h\partial}\vert_{v_0}(p_0) \] implied by the Program composition theorem, which translates to \[ e^{h\partial}\vert_{v_0}(p_n\circ\cdots\circ p_0)(v)=W_n\cdots W_0(v)\tag{16}\label{WWW}, \] upon evaluating according to \eqref{wTenSeries}.

Program basis transformations

The commonly used Fourier transform is a change of basis operation, that expresses a given function as a sum of sine and cosine functions. Now suppose a set of functions \(F_1:\{f_i:\mathcal{V}\to\mathcal{V}\}\) that was used to write the program \(P\) and that it would have been more suitable for us, either for convenience with some theoretical reasoning or otherwise, to have it expressed with the set of functions \(F\). Than the presented formalism affords us a similar feat for programs that are implementable in the virtual computing machine. The operator \(e^{h\partial}\) maps both sets of functions to the same basis \(\mathcal{X}\). Thus only the transformation tensor \(T_{F\mathcal{X}}:\mathcal{X}\to F\) is needed to perform the feat. Than the program \(P\) expressed in the basis \(F\) is simply \[ P_F=T_{F\mathcal{X}}\cdot e^{h\partial}(P)\tag{17}\label{trans} \] The details and the derivation of such tensors can be found here. For the purpose of this post, it is enough for the reader to think of it as something analogous to the Fourier transform, as when the dust settles, the same mechanics of linear algebra are at play.

When using the projected operator \(e_n^{h\partial}\), the equality \eqref{trans} does not hold, but we are still left with the best possible approximation on the first \(n\) basis vectors, similarly to what one is used to from the Fourier transform.

Putting it all together

As the image of the operator \(e^{h\partial}\) is an element of the virtual tensor machine, so is the image of the transformation tensor \eqref{trans}. Suppose a program \(P\in\mathcal{P}_0\) is a composition \(P_3\circ P_2 \circ P_1\), than the following equality holds \[ P=e^{h\partial}(P)=T_{F\mathcal{X}}\cdot e^{h\partial}(P), \] which distributes over compositions \eqref{compositions}. The implied step to take is to consider their mutual compositions with regular programs \(P\in\mathcal{P}_0\), resulting in \[ P=e^{h\partial}(P_3)\circ P_2\circ T_{F\mathcal{X}}\cdot e^{h\partial}(P_1), \] or any other combination of operator application, as summarized by the following figure.

Transformation diagram
As any path in the directed graph of Figure 6 can be taken at execution time, it affords us great flexibility in reasoning about programs, given than any sequential parts of a program can be viewed as a composition of their action on the memory.

Connection to Related work

The model fits snugly with contemporary developments in the field. For examples sake, assume we always take the rightmost path in the graph of Figure 6. As noted by the Projection transformations remark, in practice one always works with a finite projection of the operator \(e^{h\partial}_n\) and the equality \eqref{trans} becomes an approximation. Thus, instead of pre-computing the transformation tensor and having it be constant, we can make it a function of the input, where a neural network (usually a variant of an LSTM) performs the calculation of the transformation tensor. Than the described mechanics become a variant of Neural Program Interpreters, with pre-computed values of individual transformation tensors \eqref{trans} becoming meaningful initializations of the training process.

Conclusions

We have shown that virtual tensor machines have the ability to reason about the programs they implement; not only talk about them, but also alter their properties. Operational calculus offers a step towards the goal of an algebraic language in which programs can not only be written, but solved for. By reflecting the nature of deep learning, virtual tensor machines present a way for deep learning to assist in analysis of other branches of computer science. As most theories allowing analytic insights are bound to involve some limiting process, which must be cut off at some point by any implementation, deep learning can take the place of improving approximations, facilitating the move from theory to practice when reasoning about programs.