Bias-variance tradeoff is a very basic and important topic in machine learning. It basically tells us what makes up the generalization error of a hypothesis or even a leaner and the involved tradeoff between the bias and variance. By knowing this, we can better choose the learner and hypothesis that achieves low generalization errors.

1. Assumptions

All examples are $i.i.d$ , from the joint distribution $P(X, Y)$ . In this article, uppercase letters represent the random variables, while lowercase letters represent constant values.

2. Three levels

Prediction: Given a dataset $d$ , the hypothesis learned from this dataset, and the value of $x$ . The prediction $h(x)$ can be evaluated by the error .

$E_{Y}[err(h(x|d), Y)]$

Hypothesis: A hypothesis maps X to Y. It can be evaluated by the quality of predictions. It means that we want to measure the expected error of the hypothesis learned from some training set $d$ over the distribution $P(X, Y)$ .

$E_{X,Y}[err(h(X|d), Y)]$

Learner: A learner maps dataset D to hypothesis h. It can be evaluated by the quality of hypothesis. It means that we want to measure the expectation of the performance of the hypotheses over different training datasets $D$ .

$E_{D}[E_{X,Y}[err(h(X|D), Y)]]$

3. Error function

So let us focus on the case with $L2$ error function.

$\begin{equation} err(h(x|d), Y) = (h(x|d) - Y)^2 \end{equation}$

4. Analysis for evaluation of predictions

For a given $x$ and $h(x)$ , we measure our prediction over the distribution $P(Y\vert x)$

$\begin{equation} \begin{split} E_{Y|x}[err(h(x), Y)] &= E_{Y|x}[(h(x) - Y)^2]\\ &= E_{Y|x}[(h(x) -E[Y|x]+E[Y|x] - Y)^2]\\ &= E_{Y|x}[(h(x) -E[Y|x])^2] + E_{Y|x}[(E[Y|x] - Y)^2] \\ &+ 2 E_{Y|x}[(h(x)-E[Y|x])(E[Y|x] - Y)]\\ &= (h(x) -E[Y|x])^2 + E_{Y|x}[(E[Y|x] - Y)^2] \\ \end{split} \end{equation}$

Fisrt term: the square of the bias
Second term: the variance of the variable $Y\vert x$ , which is a property of the distribution and thus totally out of our control.

And from the above derivation, we know that the hypothesis that minimizes the expected error of our prediction is

$h^{opt}(x) = E[Y|x].$

5. Analysis for evaluation of hypotheses

Given the training set $d$ , we measure the hypothesis learned from this set with the expected error of its predictions over the distribution of all possible sample pair $P(X, Y)$ .

$\begin{equation} \begin{split} E_{X, Y}[err(h(X|d), Y)] &= E_{X, Y}[(h(X|d) - Y)^2]\\ &= E_{X}\{E_{Y|X}[(h(X|d) -E_{Y|X}[Y])^2] + E_{Y|X}[(E_{Y|X}[Y] - Y)^2]\} \end{split} \end{equation}$

6. Analysis for evaluation of learner

$\begin{equation} \begin{split} E_{D}\{E_{X, Y}[err(h(X|D), Y)]\} &= E_{D}\{E_{X, Y}[(h(X|D) - Y)^2]\}\\ &= E_{D}\{E_{X,Y}[(h(X|D) -E_{Y|X}[Y])^2 + E_{X,Y}[(E_{Y|X}[Y] - Y)^2]\}\\ &= E_{X,Y} \{ E_{D}[(h(X|D) -E_{D}[h(X|D)])^2 + E_{D}[(E_{D}[h(X|D)] - E_{Y|X}[Y])^2]] \\ & + 2 E_{D}[(h(X|D) -E_{D}[h(X|D)])(E_{D}[h(X|D)] - E_{Y|X}[Y]) \} \\ &+ E_{X,Y}[(E_{Y|X}[Y] - Y)^2]\\ &= E_{X,Y} \{ E_{D}[(h(X|D) -E_{D}[h(X|D)])^2]\} \\& + E_{X,Y} \{(E_{D}[h(X|D)] - E_{Y|X}[Y])^2 \}+ E_{X,Y}[(E_{Y|X}[Y] - Y)^2]\\ \end{split} \end{equation}$

The first two terms make the approximation error together.
- First term: the variance of our prediction over different training sets $D$ .
- Second term: the bias, i.e., how well our prediction approximates the true $Y$ .
Third term: the expected variance of $$Y X $over the distribution of$ P(X)$$, i.e. the random noise.

7. Furthe analysis

Let

$\hat{E}_{X, Y}[err(f(X), Y)] = \frac{1}{t} \sum^{t}_{i = 1} err(f(x_i), y_i),$

where $(x_i,y_i) \in D$ , be the error of a certain hypothesis on training set.

Now suppose we have a hypothesis space $H$ , $\hat{h}(X\vert D)$ is the hypothesis $\in H$ that minimizes the $MSE$ on training dataset $D$ , i.e.,

$\hat{h}(X|D) = argmin_{h \in H} \hat{E}_{X,Y}[(h(X) - Y)^2]$

and $h^*(X)$ is the hypothesis $\in H$ that minimizes the expected $MSE$ over $P(X, Y)$ ,i.e.

$h^*(X) = argmin_{h \in H} E_{X,Y}[(h(X) - Y)^2]$

It is easy to see that

$E_{D}[\hat{h}(X|D)] = h^*(X)$

Then we consider about the expected error of this learner

$\begin{equation} \begin{split} E_{D}\{E_{X, Y}[err(\hat{h}(X|D), Y)]\} &= E_{D}\{E_{X, Y}[(\hat{h}(X|d) - Y)^2]\}\\ &= E_{X,Y} \{ E_{D}[(\hat{h}(X|D) - h^*(X))^2 + E_{D}[(h^*(X) - Y)^2]]\\ & + 2 E_{D}[(\hat{h}(X|D) -h^*(X))](h(X) - Y) \}\\ &= E_{X} \{ E_{D}[(\hat{h}(X|D) - h^*(X))^2]\} + E_{X,Y}\{(h^*(X) - Y)^2\}\\ \end{split} \end{equation}$

First term: the variance of our learner, as $E_{D}[\hat{h}(X \vert D)] = h^*(X)$ .
Second term: the combination of the bias and noise and also reppresent the lowest possible MSE on future sample, as $h^*(X) = argmin_{h \in H} E_{X,Y}[(h(X) - Y)^2]$ .

As $h_*(X)$ is does not depend on training dataset D, we have

$E_{X, Y}[(h^*(X) - Y)^2] = E_{D}\hat{E}_{X, Y}[(h^*(X) - Y)^2]$

which means that the training error of $h^*$ is an unbiased estimate of its test error.

Then for $\hat{E}_{X, Y}[(h^*(X) - Y)^2]$ , we have

$\begin{equation} \begin{split} \hat{E}_{X, Y}[(h^*(X) - Y)^2] &= \hat{E}_{X, Y}[(h^*(X) - \hat{h}(X|D) + \hat{h}(X|D) - Y)^2]\\ &= \hat{E}_{X, Y}[(h^*(X) - \hat{h}(X|D))^2] + \hat{E}_{X, Y}[(\hat{h}(X|D) - Y)^2]\\ & + 2 \hat{E}_{X, Y}[(h^*(X) - \hat{h}(X|D))(\hat{h}(X|D) - Y)]\\ &= \hat{E}_{X, Y}[(h^*(X) - \hat{h}(X|D))^2] + \hat{E}_{X, Y}[(\hat{h}(X|D) - Y)^2] \end{split} \end{equation}$

So we have

$\begin{equation} \begin{split} E_{D}\{E_{X, Y}[err(\hat{h}(X|D), Y)]\} &= E_{D}E_{X,Y}[(\hat{h}(X|D) - h^*(X))^2] + E_{X, Y}[(h^*(X) - Y)^2]\\ &= E_{D}E_{X,Y}[(\hat{h}(X|D) - h^*(X))^2] + E_{D}\hat{E}_{X, Y}[(h^*(X) - Y)^2]\\ &= E_{D} E_{X}[(\hat{h}(X|D) - h^*(X))^2] + E_{D}[\hat{E}_{X}[(\hat{h}(X|D)-h^*(X))^2]]\\ &+ E_{D}[\hat{E}_{X, Y}[(\hat{h}(X|D) - Y)^2]]\\ \end{split} \end{equation}$

First term: variance of the leaner for future sample, as said above;
Second term: the variance of the learner on training dataset;
Third term: the leaner’s $MSE$ on training set, i.e. the expected training error. So we can easily conclude that for a learner, the expected test error $\geq$ optimal test error $\geq$ expected training error.

8. The Tradeoff

Increase the size of training set $t = \vert D \vert$

the optimal test error does not change
- the expected training error increases
- the training variance decreases
the variance on future samples decreases

Thus the overall expected test error decreases. There is no tradeoff in this part.

Increase the repressiveness of the hypothesis space

the optimal test error decreases
- the expected training error decreases
- the training variance increases
the variance on future samples increases

We don’t know if the whole thing decreases or increases. It is a trade-off.

9. Reference

[1] Prof. Schuurmans’s notes on Bias-Variance

Bias-Variance

1. Assumptions

2. Three levels

3. Error function

4. Analysis for evaluation of predictions

5. Analysis for evaluation of hypotheses

6. Analysis for evaluation of learner

7. Furthe analysis

8. The Tradeoff

Increase the size of training set $t = \vert D \vert$

Increase the repressiveness of the hypothesis space

9. Reference

Ping Jin

Comments

1. Assumptions

2. Three levels

3. Error function

4. Analysis for evaluation of predictions

5. Analysis for evaluation of hypotheses

6. Analysis for evaluation of learner

7. Furthe analysis

8. The Tradeoff

Increase the size of training set t = \vert D \vert

Increase the repressiveness of the hypothesis space

9. Reference

Ping Jin

Comments

Increase the size of training set $t = \vert D \vert$