Why is Training Error Smaller than Test Error?

Written on December 24, 2020

Tags:   statistical-ml   ols   regression   iid-assumption   optimism   model-selection

Introduction

The statement should be intuitive. A model fitted on a specific set of (training) data is expected to perform better on this data compared to another set of (test) data. Proving this statement, on the other hand, is useful because (i) it will remind us of the assumptions we employ and the randomness we incur during the model fitting process, and (ii) help introduce the concept of optimism (i.e. the difference in expected test and expected training errors assuming fixed inputs). We will first mathematically formulate the problem, and then prove the statement in two ways each coinciding with one of these two points.

Formulation

$S = \{(x_1, y_1), \ldots, (x_n, y_n)\}$ is the set of training samples and $\tilde{S} = \{(\tilde{x}_1, \tilde{y}_1), \ldots, (\tilde{x}_m, \tilde{y}_m)\}$ is the set of test samples, both taking values in $\mathbb{R}^d \times \mathbb{R}$. We are assuming that all data points in the two sets are i.i.d.; $S \sim D$ and $\tilde{S} \sim D$ where $D$ is the base distribution from which data is generated. $\DeclareMathOperator{\argmax}{arg\,max} \DeclareMathOperator{\argmin}{arg\,min}$

Let $\hat{w}$ denote the squared training error minimizing parameters of a linear regression model such that

\[\hat{w} = \underset{w \in \mathbb{R}^{d}}{\argmin} \: R_S(w) = \underset{w \in \mathbb{R}^{d}}{\argmin}\sum_{i=1}^{n} (y_i - w^\top x_i)^2\]

The ordinary least squares (OLS) method estimates $\hat{w}$ as

\[\hat{w} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top y\]

where $\mathbf{X}$ is the $n \times d$ matrix with $i$th row equal to $x_i$, and $\mathbf{y}$ is an $n$-dimensional vector with $i$th component equal to $y_i$.

A Note on Notation: The parameters of a linear regression model is usually denoted by $\beta$ in the statistics literature. Here we used $w$ instead which is used to denote the weights of any linear classifier in the machine learning literature. $ \DeclareMathOperator{\EX}{\mathbf{E}} % expected value $

We want to prove that the expected training error is always less than or equal to the expected test error, and hence the following statement:

\[\EX_{S \sim D}{\left[\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{w}^\top x_i)^2 \right]} \leq \EX_{S \sim D,\: \tilde{S} \sim D}{\left[ \frac{1}{m} \sum_{i=1}^{m} (\tilde{y}_i - \hat{w}^\top \tilde{x}_i)^2 \right]}\]

The expectations are over all that is random: the training set for the left hand side expectation, and both the training set and the test set for the right hand side expectation.

(Optional) Lifting

Often times we use the lifted $n \times (d + 1)$ matrix \(\mathbf{X}^{'}\) and the $(d+1)$-dimensional vector \(x_i^{'}\) instead of the $\mathbf{X}$ and $x_i$ denoted above. \(\mathbf{X}^{'}\) has $d$ dimensions which correspond to the features of the input data, and an extra dimension which is filled with $1$s. This yields least squares parameters \(\hat{w}^{'} \in \mathbb{R}^{d+1}\) where $d$ dimensions are the coefficients and the extra dimension is the bias term of the linear regression model. The addition of the extra dimension is known as lifting and it transforms the data points into homogenous coordinates. This is a concept used in other models and problems as well. However, it might be omitted as it doesn’t have any effect on the formulation and symbolic solution of this problem.

(Optional) Derivation of Ordinary Least Squares (OLS)

$R_S(w)$ is the error (i.e. loss) function as denoted previously, and it computes the residual sum of squares (RSS) on the training set $S$ where the residual for the $i$th observation is given by $y_i - w^\top x_i$. We can (i) rewrite it using the $\ell_2$-norm, (ii) differentiate it with respect to parameters $w$, and (iii) equate the first derivate to $0$ in order to find the ordinary least squares (OLS) estimate: $ \DeclareMathOperator{\d}{\mathrm{d} \!} $ \(\begin{align*} R_S(w) &= \sum_{i=1}^{n} (y_i - w^\top x_i)^2 = \Vert y - \mathbf{X} w \Vert_2^2 \\ &= (y- \mathbf{X} w)^T (y - \mathbf{X} w) \quad \text{by } \ell_2\text{-norm}: \Vert x \Vert_2 = \sqrt{x_1^2 + \cdots + x_n^2} = \sqrt{x^\top x} \\ &= (y^\top - w^\top \mathbf{X}^\top) (y - \mathbf{X} w) \\ &= y^\top y - w^\top \mathbf{X}^\top y - y^\top \mathbf{X}w + w^\top \mathbf{X}^\top \mathbf{X} w \\ &= y^\top y - 2 w^\top \mathbf{X}^\top y + w^\top \mathbf{X}^\top \mathbf{X} w \\ & \quad \: \text{by } y^\top \mathbf{X}w = (y^\top \mathbf{X}w)^\top = w^\top \mathbf{X}^\top y \text{ because it is a ($1 \times 1$) scalar} \\ \newline \frac{\d R_S(w)}{\d w} &= \frac{\d \: \big(y^\top y - 2 w^\top \mathbf{X}^\top y + w^\top \mathbf{X}^\top \mathbf{X} w\big)}{\d w} = 0 \\ &= -2 \mathbf{X}^\top y + \big(\mathbf{X}^\top \mathbf{X} + (\mathbf{X}^\top \mathbf{X})^\top \big) \hat{w} = 0 \\ &= -2 \mathbf{X}^\top y + 2 \mathbf{X}^\top \mathbf{X} \hat{w} = 0 \quad \text{by } (\mathbf{X}^\top \mathbf{X})^\top = \mathbf{X}^\top \mathbf{X} \\ &= - \mathbf{X}^\top y + \mathbf{X}^\top \mathbf{X} \hat{w} = 0 \\ \newline &\implies \mathbf{X}^\top \mathbf{X} \hat{w} = \mathbf{X}^\top y \\ &\implies (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{X} \hat{w} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top y \\ & \qquad \: \: \text{assuming $n \geq d$ and rank of $\mathbf{X}$ is $d$ $\therefore$ $\mathbf{X}^\top \mathbf{X}$ is invertible}\\ &\implies \hat{w} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top y \end{align*}\)

Proof 1: Randomness and the i.i.d. Assumption

The expected test error is the same whether we have $m$ test points or just $1$ test point:

\[\begin{align*} (1) \quad & \EX_{S \sim D,\: \tilde{S} \sim D}{\left[ \frac{1}{m} \sum_{i=1}^{m} (\tilde{y}_i - \hat{w}^\top \tilde{x}_i)^2 \right]} = \EX_{S \sim D,\: \tilde{S} \sim D}{\left[(\tilde{y}_1 - \hat{w}^\top \tilde{x}_1)^2 \right]} \\ & \text{by linearity of expectation: } \EX\left[\frac{1}{k} \sum_{i=1}^{k} x_i \right] = \frac{\EX[x] + \cdots + \EX[x]}{k} = \EX[x] \end{align*}\]

This means that it wouldn’t matter if we had $n$ test points as well, and hence we can assume that the train and test sets have the same number of samples, $m = n$, for the rest of the proof without loss of generality.

Let’s consider the two random variables, $A$ and $B$, which respectively correspond to the mean squared error (MSE) on the training set and the test

\[\begin{align*} A &= \overline{R_S(\hat{w})} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{w}^\top x_i)^2 \\ B &= \overline{R_{\tilde{S}}(\tilde{w})} = \frac{1}{n} \sum_{i=1}^{n} (\tilde{y}_i - \tilde{w}^\top \tilde{x}_i)^2 \\ \end{align*}\]

where $\hat{w}$ is the same squared training error minimizing parameters from before, and $\tilde{w}$ is the squared test error minimizing parameters.

We can compare the expectations of the two random variables with our prior knowledge:

\[\begin{align*} \EX_{S \sim D}[A] &= \EX_{S \sim D}{\left[(y_1 - \hat{w}^\top x_1)^2 \right]} \quad \text{by } (1) \\ \EX_{\tilde{S} \sim D}[B] &= \EX_{\tilde{S} \sim D}{\left[(\tilde{y}_1 - \tilde{w}^\top \tilde{x}_1)^2 \right]} \quad \text{by } (1) \\ (2) \quad \EX_{S \sim D}[A] &= \EX_{\tilde{S} \sim D}[B] \quad \text{by the i.i.d. assumption} \\ \end{align*}\]

From the definition of the ordinary least squares estimate (i.e. that it minimizes the $\ell_2$-norm), we know that parameters $\hat{w}$ and $\tilde{w}$ yield errors less than or equal to the errors that we would have gotten if we used any other random vector $w$ on their respective sets. We can thus deduce:

\[\begin{align*} B &\leq \frac{1}{n} \sum_{i=1}^{n} (\tilde{y}_i - \hat{w}^\top \tilde{x}_i)^2 \quad \text{where } w = \hat{w} \\ &\implies \EX_{\tilde{S} \sim D}[B] \leq \EX_{S \sim D,\: \tilde{S} \sim D}{\left[ \frac{1}{n} \sum_{i=1}^{n} (\tilde{y}_i - \hat{w}^\top \tilde{x}_i)^2 \right]} \\ &\implies \EX_{S \sim D}[A] \leq \EX_{S \sim D,\: \tilde{S} \sim D}{\left[ \frac{1}{n} \sum_{i=1}^{n} (\tilde{y}_i - \hat{w}^\top \tilde{x}_i)^2 \right]} \quad \text{by } (2) \\ &\therefore \EX_{S \sim D}{\left[\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{w}^\top x_i)^2 \right]} \leq \EX_{S \sim D,\: \tilde{S} \sim D}{\left[ \frac{1}{n} \sum_{i=1}^{n} (\tilde{y}_i - \hat{w}^\top \tilde{x}_i)^2 \right]} \\ & \tag*{Q.E.D.} \\ \end{align*}\]

Proof 2: Optimism

Optimism or the optimism bias is defined as the difference between the expected in-sample error and the training error. The in-sample error is the error observed when the model fitted on the original training data ($x_i$,$y_i$) is used to predict new outcome values \(y^{'}_i\) that are sampled at each of the original training inputs $x_i$. Therefore, if we assume fixed inputs for simplicity, optimism becomes the difference between the expected test error and the expected training error. Due to this extra assumption, the first proof is a more general one.

We can adopt the previously defined random variable $A$ for the training error, and define the in-sample error with a new variable $C$

\[C = \overline{R_{S^{'}}(\hat{w})} = \frac{1}{n} \sum_{i=1}^{n} (y^{'}_i - \hat{w}^\top x_i)^2 \\\]

where \(S^{'} = \{(x_1, y^{'}_1), \ldots, (x_n, y^{'}_n)\}\) is the in-sample set as defined above, and $\hat{w}$ is the same squared training error minimizing parameters from before.

Optimism is then mathematically defined as

\[(3) \quad \text{optimism} = C - A\]

Average optimism is the expectation of the $\text{optimism}$ over all possible training sets from $D$ and is defined as $ \DeclareMathOperator{\Cov}{\mathrm{Cov}} % covariance $

\[\omega = \EX_{y, \: y^{'}}{[\text{optimism}]}\]

where training inputs $x_i$ are fixed, and the randomness comes from the sampling of the outcome values $y_i$ and \(y^{'}_i\) as opposed to the entire sets (i.e. \(S \sim D, \: S^{'} \sim D\)).

We can think of the outcome values as being drawn according to the following observed relationships $ \DeclareMathOperator{\Var}{\mathrm{Var}} % variance $

\[\begin{align*} y_i &= f(x_i) + \epsilon_i \\ y^{'}_i &= f(x_i) + \epsilon^{'}_i \\ \end{align*}\]

where $f$ is the true (unknown) population relationship, and $\epsilon_i$ and \(\epsilon^{'}_i\) are the intrinsic noises associated with the observed input $x_i$.

$(4)$ We can further assume that both noises $\epsilon_i$ and \(\epsilon^{'}_i\), and by extension both outcome variables \(y_i\) and \(y^{'}_i\), are i.i.d. with mean \(0\) and variance \(\sigma_{\epsilon}^{2}\).

Let’s now derive a simplified expression for $\omega$: $\newcommand{\indep}{\perp \!\!\! \perp} % independence $

\[\begin{align*} \omega &= \EX_{y, \: y^{'}}{[\text{optimism}]} \\ &= \EX_{y, \: y^{'}}{[C - A]} \quad \text{by } (3) \\ &= \EX_{y^{'}}{[C]} - \EX_{y}{[A]} \quad \text{by linearity of expectation} \\ &= \EX{\left[\frac{1}{n} \sum_{i=1}^{n} (y^{'}_i - \hat{w}^\top x_i)^2\right]} - \EX{\left[\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{w}^\top x_i)^2\right]} \\ &= \EX{\left[\frac{1}{n} \sum_{i=1}^{n} (y^{'}_i - \hat{y}_i)^2\right]} - \EX{\left[\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\right]} \quad \text{by } \hat{w}^\top x_i = \hat{y}_i \\ &= \frac{1}{n} \sum_{i=1}^{n} \left( \EX{\left[(y^{'}_i - \hat{y}_i)^2 - (y_i - \hat{y}_i)^2\right]} \right) \quad \text{by linearity of expectation} \\ &= \frac{1}{n} \sum_{i=1}^{n} \left( \EX{\left[(y^{'}_{i})^{2} - 2 y^{'}_i \hat{y}_i + \hat{y}^2_i - y^2_i + 2 y_i \hat{y}_i - \hat{y}^2_i \right]} \right) \quad \text{by square expansion} \\ &= \frac{1}{n} \sum_{i=1}^{n} \left( \EX{\left[(y^{'}_{i})^{2} - y^2_i - 2 y^{'}_i \hat{y}_i + 2 y_i \hat{y}_i \right]} \right) \\ &= \frac{1}{n} \sum_{i=1}^{n} \left( \EX{\left[(y^{'}_{i})^{2}\right]} - \EX{\left[y^2_i\right]} - 2 \EX{\left[y^{'}_i \hat{y}_i \right]} + 2 \EX{\left[y_i \hat{y}_i \right]} \right) \quad \text{by linearity of expectation} \\ &= \frac{1}{n} \sum_{i=1}^{n} \left(2 \EX{\left[y_i \hat{y}_i \right]} - 2 \EX{\left[y^{'}_i \hat{y}_i \right]} \right) \quad \text{by } (4) \therefore \EX{[y^{'}_{i}]} = \EX{[y_{i}]} \\ &= \frac{2}{n} \sum_{i=1}^{n} \left(\EX{\left[y_i \hat{y}_i \right]} - \EX{\left[y^{'}_i \hat{y}_i \right]} \right) \quad \text{by linearity of expectation} \\ &= \frac{2}{n} \sum_{i=1}^{n} \left(\EX{\left[y_i \hat{y}_i \right]} - \EX{\left[y^{'}_i\right]} \EX{\left[\hat{y}_i \right]} \right) \quad \text{by } (4) \therefore y^{'}_i \indep y_i \therefore y^{'}_i \indep \hat{y}_i \iff \EX{[y^{'}_i y_i]} = \EX{[y^{'}_i]} \EX{[y_i]} \\ &= \frac{2}{n} \sum_{i=1}^{n} \left(\EX{\left[y_i \hat{y}_i \right]} - \EX{\left[y_i\right]} \EX{\left[\hat{y}_i \right]} \right) \quad \text{by } (4) \therefore \EX{[y^{'}_{i}]} = \EX{[y_{i}]} \\ &= \frac{2}{n} \sum_{i=1}^{n} \Cov_{y}(y_i, \hat{y}_i) \quad \text{by } \Cov(X,Y) = \EX{[XY]} - \EX{[X]} \EX{[Y]} \end{align*}\]

There are two things to note here: (i) Optimism increases as the correlation between the ground truth label $y_i$ and the model predicted value $\hat{y}_i$ increases, or in other words as $y_i$ starts affecting its own prediction stronger and stronger, and (ii) this would approximately hold for any model class and any well-fitted (on the training data) parameters $w$.

Here we are assuming a linear regression model, and we are using the least squares estimate $\hat{w}$ which will help us derive $\omega$ further.

First, we can introduce the $n \times n$ hat matrix $\mathbf{H} = \mathbf{X} (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top$ which projects orthogonally onto the column space of $\mathbf{X}$ and is directly derived from the definition of the ordinary least squares estimate of $\hat{w}$ from before. Then, we can show:

\[\begin{align*} (5) \quad \Cov_y(y, \hat{y}) &= \Cov_y(y, \mathbf{H} y) \quad \text{by } \hat{y} = \mathbf{H} y \\ &= \mathbf{H} \Cov_y(y, y) \\ & \quad \: \text{by } \mathbf{X} \text{ is constant} \therefore \mathbf{H} \text{ is constant and } \Cov(aX, bY) = ab \Cov(X,Y) \\ &= \mathbf{H} \Var_{y}(y) \quad \text{by } \Cov(X,X) = \Var(X) \\ &= \mathbf{H} \sigma_{\epsilon}^{2} I_{n \times n} \\ & \quad \: \text{by } (4) \therefore \Var(y_i) = \sigma_{\epsilon}^{2} \text{ and } y_i \indep y_j \iff \Cov(y_i, y_j) = 0 \text{ for } i \neq j \\ &= \sigma_{\epsilon}^{2} \mathbf{H} \end{align*}\]

where $\Cov_y(y, \hat{y})$ is the $n \times n$ cross-covariance matrix, $\Cov_y(y, y) = \Var_{y}(y)$ is the $n \times n$ variance-covariance matrix, and $y$ and $\hat{y}$ are $n$-dimensional vectors with $i$th components respectively equal to $y_i$ and $\hat{y}_i$. $ \DeclareMathOperator{\Tr}{\mathrm{Tr}} % covariance $

The cross-covariance matrix $\Cov(y, \hat{y})$ consists of the components:

\[[\Cov(y, \hat{y})]_{i, j} = \Cov(y_i, \hat{y}_j) \quad \text{for } i, j = 1, \ldots, n\]

We can now rewrite $\omega$ using the diagonal (i.e. when $i = j$) of the cross-covariance matrix $\Cov(y, \hat{y})$:

\[\begin{align*} \omega &= \frac{2}{n} \sum_{i=1}^{n} \Cov(y_i, \hat{y}_i) = \frac{2}{n} \Tr(\Cov(y, \hat{y})) = \frac{2}{n} \Tr(\sigma_{\epsilon}^{2} \mathbf{H}) \quad \text{by } (5) \\ &= \frac{2}{n} \sigma_{\epsilon}^{2} \Tr{(\mathbf{H})} \quad \text{by } \Tr{(k A)} = k \Tr{(A)} \\ &= \frac{2}{n} \sigma_{\epsilon}^{2} \Tr{(\mathbf{X} (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top)} \\ &= \frac{2}{n} \sigma_{\epsilon}^{2} \Tr{(\mathbf{X}^\top \mathbf{X} (\mathbf{X}^\top \mathbf{X})^{-1})} \quad \text{by commutativity of $\Tr(\cdot)$: } \Tr(ABC) = \Tr(BCA) = \Tr(CAB) \\ &= \frac{2}{n} \sigma_{\epsilon}^{2} \Tr{(I_{d \times d})} \quad \text{by } A A^{-1} = I \\ &= \frac{2 d}{n} \sigma_{\epsilon}^{2} \end{align*}\]

We have now proven that $\omega \geq 0$ (i.e. zero or positive) as $d$ and $n$ are positive scalars corresponding respectively to the number of features and the training sample size (i.e. number of data points), and $\sigma_{\epsilon}^{2} \geq 0$. This in turn implies that the expected test error is equal to or larger than the expected training error when inputs are fixed:

\[\begin{align*} &= \EX_{y^{'}}{[C]} = \EX_{y}{[A]} + \omega \\ &= \EX{\left[\frac{1}{n} \sum_{i=1}^{n} (y^{'}_i - \hat{w}^\top x_i)^2\right]} = \EX{\left[\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{w}^\top x_i)^2\right]} + \frac{2 d}{n} \sigma_{\epsilon}^{2} \\ \frac{2 d}{n} \sigma_{\epsilon}^{2} \geq 0 &\implies \EX{\left[\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{w}^\top x_i)^2\right]} \leq \EX{\left[\frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - \hat{w}^\top x_i)^2\right]} \\ & \tag*{Q.E.D.} \\ \end{align*}\]

An observation to be made is that (i) optimism increases linearly with the number of features $d$, but (ii) decreases as the training sample size increases.

Versions of $\omega = \frac{2 d}{n} \sigma_{\epsilon}^{2}$ hold approximately true for other model classes and error functions. We should also note that the first proof is more general, and the in-sample error is not usually of interest as the inputs across different sets are not likely to coincide with each other. Therefore, out-of-sample error estimation using cross-validation or bootstrapping is often more useful in real-world applications. Nevertheless, estimating the in-sample error helps with model selection, and solving for optimism assuming fixed inputs helps prove our initial statement for this special case.

References

[1] CMU Advanced Methods for Data Analysis - Homework 2

[2] Chapter 7 of The Elements of Statistical Learning (2nd Edition) by Trevor Hastie, Robert Tibshirani, and Jerome Friedman

[3] Wikipedia - Ordinary Least Squares Proofs

[4] CMU Statistical Machine Learning - Modeling Basics: Assessment, Selection, and Complexity

[5] The Matrix Cookbook