Jekyll2019-08-05T18:43:49+00:00https://brynhayder.github.io/feed.xmlBryn ElesedyDeriving the Neural Tangent Kernel2019-04-02T00:00:00+00:002019-04-02T00:00:00+00:00https://brynhayder.github.io/jekyll/update/2019/04/02/neural-tangent-kernel<h2 id="introduction">Introduction</h2>
<p>I recently came across the paper <a href="https://arxiv.org/abs/1806.07572">Neural Tangent Kernel: Covergence and Generalization in Neural Networks</a> published at NeurIPS 2018.
I was super excited by this paper and I think it is a great execution of a very natural idea.
I thought I would try to reproduce the ideas of the paper myself.
Hence, this post is a reconstruction of the general idea of the paper in a way that seems natural to me.
[
\newcommand{\params}{\vec{\theta}}
\newcommand{\net}{f^\params}
\newcommand{\F}{\mathcal{F}}
\newcommand{\X}{\mathcal{X}}
]</p>
<h2 id="roadmap">Roadmap</h2>
<p>We have a neural network that we will train using gradient descent on some loss function.
We are interested in characterising the behaviour of the optimisation, in particular we want to know what the cost does under gradient descent.</p>
<p>We will explore this question directly and find an object (a kernel on a space of functions in which our network lives), the neural tangent kernel (NTK), and see that our optimisation procedure will result in a network that performs well on our task if the NTK satisfies a certain condition (is positive definite).
The central contributions of the paper are to provide the insight into the dynamics of the cost and prove the aforementioned property.
I’m not going to go through the proofs (at least in the first version of this post) but I will hopefully give some insight into the general idea.</p>
<h2 id="networks-as-functions">Networks as Functions</h2>
<p>Neural networks are a special class of parameterised functions.
As we vary the parameters $\params$, the network function $\net$ varies on a family of functions defined by the architecture.
The choice of $\params$ determines the behaviour of the network and we will typically choose $\params \in \R^P$ to (approximately) minimise some <em>cost</em> function $c(\params) = C[\net]$ (here we say cost rather than loss for consistency with the paper).
Note that we have defined the cost in terms of a functional $C$, we will be interested in the behaviour of $C$ on the space of functions $\F = \{\net | \params \in \R^P\}$ attainable by our network. (There are some technical requirements for $\F$ that we will ignore.)
For the purposes of this post, a functional is just a mapping from functions (so from a space like $\F$) to $\R$.
If we have a functional $J$ and a function $g$ then $J$ evaluated on $g$ is a real number written $J[g]$ with square brackets.</p>
<h2 id="the-cost-functional">The Cost Functional</h2>
<p>We have made a change in perspective from thinking of our network in terms of parameter space to thinking of it living in the function space $\F$.
Correspondingly, we are going to study the dynamics of $C$ on $\F$ rather than of $\params$ on $\R^P$.</p>
<p>Suppose we perform gradient descent on the parameters, so that they vary on the trajectory
[
\diff{\params}{t} = - \pdiff{C[\net]}{\params},
]
where $t$ plays the role of “time”.
From now on we will have $\params = \params(t)$, but leave time time dependence implicit for notational ease.
We want to analyse the dynamics of the cost in this scheme, and in order to do that we will need the functional derivative of $C$.</p>
<h2 id="functional-derivatives">Functional Derivatives</h2>
<p>We want to measure the variation of a functional as we move a function in our function space.
To do this we will define a notion of a derivative analogous to the $f: \R \to \R$ case.</p>
<p>Given a (suitably nice) space of functions $\X \to \R^k$ which we’ll call $\mathcal{G}$, $g \in \mathcal{G}$ and a functional $J: \mathcal{G} \to \R$, the functional derivative of $J$ at $g$ is written
[
\fdiff{J}{g}.
]
This is the element in $\mathcal{G}$ such that for <em>any</em> $\phi \in \mathcal{G}$
[
\int_\X \fdiff{J}{g}(x)^{T} \phi(x) \d x = \lim_{\epsilon \to 0} \frac{J[g + \epsilon \phi] - J[g]}{\epsilon}.
]
(This exists and is unique by <a href="https://en.wikipedia.org/wiki/Riesz-Markov-Kakutani_representation_theorem">Riesz-Markov-Kakutani Representation Theorem</a>.)
You can think of this as the change in $J$ from moving infinitesimally at $g$ in the direction of $\phi$, analogous to the familiar fact that the directional derivative of $f$ in the direction of $\vec{n}$ is $\vec{n}\cdot\nabla f$.</p>
<h3 id="how-does-changing-the-parameters-change-the-cost">How Does Changing the Parameters Change the Cost?</h3>
<p>Suppose we vary the parameters of our network $\params \to \params + \epsilon \vec{\eta}$, how does the cost change?</p>
<div>
\begin{align*}
\lim_{\epsilon \to 0} \frac{C[f^{\params + \epsilon \eta}] - C[\net]}{\epsilon}
&= \lim_{\epsilon \to 0} \frac{C[\net + \epsilon \eta \cdot \pdiff{\net}{\params} + O(\epsilon^2)] - C[\net]}{\epsilon} \\
&= \sum_{i,j} \eta_i \int_\X \fdiff{C}{\net}(x)_j \pdiff{\net_j}{\theta_i}(x) \d x
\end{align*}
</div>
<p>Note that we have just calculated $\vec{\eta} \cdot \pdiff{C[\net]}{\params}$.
We can calculate $\diff{C[\net]}{t}$ (recalling that $\params = \params(t)$) by setting $\eta = \diff{\params}{t}$.</p>
<h2 id="the-neural-tangent-kernel">The Neural Tangent Kernel</h2>
<p>We can now turn back to our original question: what are the dynamics of the cost under gradient descent?
First recall that under gradient descent we have
[
\diff{\params}{t} = - \pdiff{C[\net]}{\params}.
]
Now we can calculate</p>
<div>
\begin{align*}
\diff{C[\net]}{t}
&= \sum_{i, j} \diff{\theta_j}{t} \int_\X \fdiff{C}{\net}(x)_i\pdiff{\net}{\theta_j}(x)_i \d x \\
&= -\sum_{i, j, k} \int_\X \fdiff{C}{\net}(x')_k \pdiff{\net}{\theta_i}(x')_k \d x' \int_\X \fdiff{C}{\net}(x)_j\pdiff{\net}{\theta_i}(x)_j \d x \\
&= - \sum_{j, k = 1}^{M} \int_\X \fdiff{C}{\net}(x)_j \left( \sum_{i=1}^P \pdiff{\net}{\theta_i}(x)_j \pdiff{\net}{\theta_i}(x')_k\right) \fdiff{C}{\net}(x')_k \d x \d x' \\
&= - \int_\X \fdiff{C}{\net}(x)^T K_{\text{NTK}}(x, x') \fdiff{C}{\net}(x') \d x \d x' \\
&= - \left\lVert{} \fdiff{C}{\net} \right\rVert{}^2_{K_\text{NTK}}.
\end{align*}
</div>
<p>Where we have introduced the <em>neural tangent kernel</em>
[
K_{\text{NTK}} = \sum_{i=1}^{P} \pdiff{\net}{\theta_i} \otimes \pdiff{\net}{\theta_i}
]
and the final line is the correspondingly induced norm.</p>
<p>We see that if this kernel is positive definite, then the cost will converge to a global optima on $\F$.
In the paper it is shown that at (Gaussian) initialisation the kernel is indeed positive definite in the infinite width limit, and, also in the infinite width limit, that it remains approximately constant throughout training.</p>Introduction I recently came across the paper Neural Tangent Kernel: Covergence and Generalization in Neural Networks published at NeurIPS 2018. I was super excited by this paper and I think it is a great execution of a very natural idea. I thought I would try to reproduce the ideas of the paper myself. Hence, this post is a reconstruction of the general idea of the paper in a way that seems natural to me. [ \newcommand{\params}{\vec{\theta}} \newcommand{\net}{f^\params} \newcommand{\F}{\mathcal{F}} \newcommand{\X}{\mathcal{X}} ]Notes on Gaussian Processes for Regression2018-03-07T00:00:00+00:002018-03-07T00:00:00+00:00https://brynhayder.github.io/jekyll/update/2018/03/07/Gaussian-Process-Regression<p><strong>Disclaimer</strong></p>
<p>These notes are mostly for my own purposes, so they may be a bit rubbish in some places.</p>
<hr />
<h2 id="motivation">Motivation</h2>
<p>We will be interested in regression problems with a single output dimension</p>
<div>
\begin{equation}
y = f(\mathbf{x}) + \varepsilon
\end{equation}
</div>
<p>where <script type="math/tex">\mathbf{x} \in \mathbb{R}^n</script>, <script type="math/tex">y \in \mathbb{R}</script> and <script type="math/tex">\varepsilon \sim \mathcal{N}(0, \sigma^2)</script></p>
<p>We will find that characterising <script type="math/tex">f</script> as a Gaussian process (GP) provides a flexible yet interpretable family of models for this problem.</p>
<p><img src="/images/Gaussian-Process-Regression/generic_fit.png" alt="generic_fit" /></p>
<h2 id="definition">Definition</h2>
<blockquote>
<p>A Gaussian process is a (possibly infinite) collection of random variables such that any finite subset forms multivariate Gaussian distribution</p>
</blockquote>
<p>We are interested in the case in which the GP is parameterised by <script type="math/tex">\mathbf{x} \in \mathbb{R}^n</script>. That is, our GPs will provide a measure over functions <script type="math/tex">f: \mathbb{R}^n \to \mathbb{R}</script>.</p>
<p>We denote a GP <script type="math/tex">f</script> by</p>
<div>
\begin{equation}
f \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x'}))
\end{equation}
</div>
<p>where <script type="math/tex">m: \mathbb{R}^n \to \mathbb{R}</script> is the mean function and <script type="math/tex">k</script> is the covariance kernel, which returns a covariance matrix</p>
<div>
\begin{equation}
k: \mathbb{R}^n \times \mathbb{R}^n \to \{\Sigma \in GL_n(\mathbb{R}) \, | \, \Sigma \succeq 0 \land \Sigma^{\mathsf{T}} = \Sigma\}.
\end{equation}
</div>
<p>These functions satisfy</p>
<div>
\begin{align}
m(\mathbf{x}) &= \mathrm{E}[f(\mathbf{x})] \\
k(\mathbf{x}, \mathbf{x}') &= \mathrm{cov}[f(\mathbf{x}), f(\mathbf{x}')].
\end{align}
</div>
<p>Evaluating <script type="math/tex">f</script> at <script type="math/tex">\mathbf{x}</script> gives a Gaussian distribution</p>
<div>
\begin{equation}
\mathbf{f} \sim \mathcal{N}(\mathbf{m}, \Sigma)
\end{equation}
</div>
<p>where <script type="math/tex">\mathbf{m} = m(\mathbf{x})</script> and <script type="math/tex">\Sigma_{ij} = k(x_i, x_j)</script>. As such, we can think of a Gaussian process (GP) as providing a distribution over functions via the evaluation functional.</p>
<p>Note that we have specified the GP in terms of a function that generates a covariance matrix, rather than a precision matrix. This is because the GP needs to satisfy the same marginalisation properties as the Gaussian distribution.</p>
<h2 id="the-big-picture">The Big Picture</h2>
<p>The idea is that a GP gives a way of specifying a prior distribution over functions (in the Bayesian sense), by choosing the mean and covariance functions.</p>
<p>Then, using some training data, we can calculate a posterior distribution over functions for our regression problem. This allows us to make predictions and gives confidence intervals in a natural way.</p>
<p>Next we will talk about specifying the prior mean and covariance functions, then we will discuss how to a make predictions using GPs.</p>
<h2 id="the-mean-function">The Mean Function</h2>
<p>As we saw above, the mean function specifies the mean of the draws from the GP.</p>
<p>After this section, we will assume that our GP is specified to have prior mean of zero. There are a couple reasons for this.</p>
<p>First, if we wanted to model using a deterministic mean function, we could always just model <script type="math/tex">y - m = f - m + \varepsilon</script> and put the mean back in after prediction.</p>
<p>Second, the mean can be marginalised out. That is</p>
<div>
\begin{align}
f|a &\sim \mathcal{GP}(a\mu(\mathbf{x}), k(\mathbf{x}, \mathbf{x}')) \quad\textrm{with}\quad a \sim \mathcal{N}(0, 1) \\
\implies\quad f &\sim \mathcal{GP}(0, \mu(\mathbf{x})\mu(\mathbf{x}') + k(\mathbf{x}, \mathbf{x}')).
\end{align}
</div>
<p>If you use a local kernel (one that decays quickly to zero as the covariates move apart in <script type="math/tex">\mathbb{R}^n</script>), then away from the training data you will predict the prior mean of your GP. This can provide a downside to modelling a trivial prior mean <script type="math/tex">m = 0</script>, since you lose structure in your predictions away from the training data.</p>
<h2 id="the-covariance-kernel">The Covariance Kernel</h2>
<p>When drawing from a GP it is the kernel that determines how these random variables are correlated. The covariance kernel is what gives structure to the GP and the majority of modelling effort will generally go into choosing the kernel.</p>
<h4 id="definition-1">Definition</h4>
<p>The kernel is the key determinant of the structure of the GP. As such, this choice encodes your assumptions about the data generating process. Below we will list a few of the common choices for kernels, but you are by all means allowed to define your own kernel, the only requirement is that your function generates a valid covariance matrix.</p>
<blockquote>
<p>A function <script type="math/tex">k(x, x')</script> is a kernel if it is symmetric in its arguments and is positive semi-definite</p>
</blockquote>
<p>By positive semi-definite we mean that for all <script type="math/tex">f</script> in some (square-normed function) space we have</p>
<div>
\begin{equation}
\int k(x, x') f(x)f(x') \, \mathrm{d}\mu(x) \mathrm{d}\mu(x') \geq 0.
\end{equation}
</div>
<h4 id="isotropy-and-stationarity">Isotropy and Stationarity</h4>
<ul>
<li>
<p>We say that a kernel is isotropic if it is a function of <script type="math/tex">\|\mathbf{x} - \mathbf{x}'\|</script>. That is, the kernel is invariant with respect to rotations.</p>
</li>
<li>
<p>We say that a kernel is stationary if it is a function of <script type="math/tex">\mathbf{x} - \mathbf{x}'</script>. That is, the kernel is invariant with respect to translations.</p>
</li>
</ul>
<p>You may have already heard of stationarity in the context of stochastic processes. A GP is weakly stationary if its mean function is constant and its kernel is stationary in the sense defined above. It is strictly stationary if all of its finite dimensional distributions are invariant to translation.</p>
<h4 id="examples-of-kernels">Examples of Kernels</h4>
<p>We list a few common examples of kernels. For more information see <a href="http://www.gaussianprocess.org/gpml/"><em>Gaussian Processes for Machine Learning</em></a>.</p>
<ul>
<li>
<p>Radial Basis Function (RBF): <script type="math/tex">k(\mathbf{x}, \mathbf{x'}) = \mathrm{exp}\left(- \frac{\|\mathbf{x} - \mathbf{x}'\|^2}{2 l^2}\right)</script></p>
</li>
<li>
<p>Ornstein-Uhlenbeck (OU): <script type="math/tex">k(\mathbf{x}, \mathbf{x'}) = \mathrm{exp}\left(- \frac{\|\mathbf{x} - \mathbf{x}'\|}{l}\right)</script></p>
</li>
<li>
<p>Periodic (Per): <script type="math/tex">k(\mathbf{x}, \mathbf{x'}) = \mathrm{exp}\left(- \frac{2}{l^2} \mathrm{sin}^2\left(\frac12\|\mathbf{x} - \mathbf{x}'\|^2 \right)\right)</script></p>
</li>
<li>
<p>Rational Quadratic (RQ): <script type="math/tex">\left(1 + \frac{r^2}{2\alpha l^2} \right) ^ {-\alpha}</script></p>
</li>
</ul>
<p>Note that all of these examples are stationary. The RBF and Per kernels both give rise to processes that are infinitely differentiable (in mean square sense).</p>
<p><img src="/images/Gaussian-Process-Regression/PriorDraws.png" alt="kernel_examples" /></p>
<p>You may have noticed that there are additional parameters in these kernel functions. These are known as hyperparameters and they form part of the model selection problem for GPs, we will talk about them later.</p>
<h4 id="kernel-algebra">Kernel Algebra</h4>
<p>It is straightforward to show that the sum of two kernels is also a kernel and the product of two kernels is again a kernel. This means that it is possible to build structure in your models hierarchically by composing structure from various kernels. For more on this, see the first few chapters of <a href="https://www.cs.toronto.edu/~duvenaud/thesis.pdf">David Duvenaud’s PhD thesis</a>.</p>
<h2 id="regression">Regression</h2>
<p>We will now explore how, given some training data <script type="math/tex">\mathcal{D} = (X, \mathbf{y})</script>, we can make predictions at test points <script type="math/tex">X_*</script> for the problem defined at the beginning</p>
<div>
\begin{equation}
y = f(\mathbf{x}) + \varepsilon \quad \mathrm{with} \quad \varepsilon \sim \mathcal{N}(0, \sigma^2).
\end{equation}
</div>
<p>We will follow the convention of Rasmussen and Williams and stack our training <script type="math/tex">\mathbf{x}</script> <em>horizontally</em> so that each columns of <script type="math/tex">X</script> is a training data point. Note that this is the transpose of how you will often see the design matrix!</p>
<p>Denote the stacked predicted values for <script type="math/tex">f</script> as <script type="math/tex">\mathbf{f}_*</script>, then by the marginalisation property we have the following joint distribution of the training and test data</p>
<div>
\begin{align}
\begin{bmatrix}
\mathbf{y} \\
\mathbf{f}_* \\
\end{bmatrix}
& \sim \mathcal{N}\left(0,
\begin{bmatrix}
K(X, X) + \sigma^2 I & K(X, X_*)\\
K(X_*, X) & K(X_*, X_*)
\end{bmatrix}
\right).
\end{align}
</div>
<p>Now all we need to do to get the posterior predictive distribution for <script type="math/tex">\mathbf{f}_*</script> is to use the Guassian conditioning formula to arrive at</p>
<div>
\begin{equation}
\mathbf{f}_* | \mathbf{y}, X, X_*
\sim
\mathcal{N} \left(
K(X_*, X)\left[K(X, X) + \sigma^2 I\right]^{-1}\mathbf{y}, \,\,
K(X_*, X_*) - K(X_*, X)\left[K(X, X) + \sigma^2 I\right]^{-1}K(X, X_*)
\right).
\end{equation}
</div>
<p>In the Bayesian formulation we choose our predictions to minimize the expected value of some loss function, with the expectation taken against the distribution of the prediction points. Typically, one would choose a squared error loss function, which would result in predicting the mean of this distribution</p>
<div>
\begin{equation}
\mathbf{y}_* = K(X_*, X)\left[K(X, X) + \sigma^2 I\right]^{-1}\mathbf{y}.
\end{equation}
</div>
<p>The final point to note here is that the prediction at a single test point is a linear combination of kernel evaluations at the test point and on the training set, reminiscent of the representor theorem.</p>
<h2 id="model-selection">Model Selection</h2>
<h4 id="hyperparameters">Hyperparameters</h4>
<p>As stated above, kernel covariance functions often come in families parameterised by some vector of hyperparameters <script type="math/tex">\mathbf{\theta}</script></p>
<div>
\begin{equation}
k(\mathbf{x}, \mathbf{x}') = k(\mathbf{x}, \mathbf{x}'; \mathbf{\theta}).
\end{equation}
</div>
<p>In the case of the RBF kernel, this was the length scale <script type="math/tex">l</script>. As we will see, the choice of hyperparameters forms a key part of the model selection process for GPs.</p>
<h4 id="bayesian-model-selection">Bayesian Model Selection</h4>
<p>Given some class of hypotheses for our problem <script type="math/tex">\{\mathcal{H}_i\}</script>, we find our distribution for <script type="math/tex">\mathbf{y}</script> as</p>
<div>
\begin{equation}
\mathsf{P}(\mathbf{y} | X, \mathbf{\theta}, \mathcal{H}_i)
= \int \mathsf{P}(\mathbf{y} | X, \mathbf{\theta}, \mathcal{H}_i, \mathbf{f}) \mathsf{P}(\mathbf{f} | \mathbf{\theta}, \mathcal{H}_i) \, \mathrm{d}\mathbf{f}
\end{equation}
</div>
<p>where we have integrated over the function values <script type="math/tex">\mathbf{f}</script> according to the measure given by the GP. This term is known as the marginal likelihood or the model evidence.</p>
<p>Ideally, we would proceed by first integrating out the hyperparameters</p>
<div>
\begin{equation}
\mathsf{P}(\mathbf{y} | X, \mathcal{H}_i)
= \int \mathsf{P}(\mathbf{y} | X, \mathbf{\theta}, \mathcal{H}_i) \mathsf{P}(\mathbf{\theta} | \mathcal{H}_i) \, \mathrm{d}\mathbf{\theta}
\end{equation}
</div>
<p>and then find the distribution for each hypothesis using Bayes’ rule</p>
<div>
\begin{equation}
\mathsf{P}(\mathcal{H}_i | \mathbf{y}, X)
= \frac{\mathsf{P}(\mathbf{y} | X, \mathcal{H}_i) \mathsf{P}(\mathcal{H}_i)}{\mathsf{P}(\mathbf{y} | X)}
\end{equation}
</div>
<p>with <script type="math/tex">\mathsf{P}(\mathbf{y}|X) = \sum_i \mathsf{P}(\mathbf{y}, X, \mathcal{H}_i)\mathsf{P}(\mathcal{H}_i)</script>.</p>
<p>Unfortunately, the integral over the hyperparameters is often intractable. In general, people sidestep this issue by approximating the integral by Laplace’s method. In turn, this means maximising the log marginal likelihood w.r.t the hyperparameters</p>
<div>
\begin{equation}
l(\mathbf{\theta}) = \mathrm{log}(\mathsf{P}(\mathbf{y} | X, \mathbf{\theta}, \mathcal{H}_i)).
\end{equation}
</div>
<p>One would then use <script type="math/tex">\theta^* = \underset{\mathbf{\theta}}{\mathrm{argmax}} \, l</script> in the steps that followed.</p>
<p>Finally, one should note that at the training points</p>
<div>
\begin{equation}
\mathbf{y}| X, \mathbf{\theta}, \mathcal{H}_i \sim \mathcal{N}(0, K(X, X) + \sigma^2 I)
\end{equation}
</div>
<p>so you don’t need to do any fiddling to find <script type="math/tex">l(\mathbf{\theta})</script>.</p>
<!--
## An Example Application
## Where to look next
- Duvenaud paper
- Duvenaud's thesis
- GPML book
- etc.
-->
<h2 id="acknowledgements">Acknowledgements</h2>
<p>These notes borrow heavily from the bible on GPs, <em>Gaussian Processes for Machine Learning</em> by Rasmussen and Williams, see <a href="http://www.gaussianprocess.org/gpml/">here</a>.</p>
<p>Any errors are mine. (Obviously.)</p>Disclaimer