Bias-variance tradeoff is a very basic and important topic in machine learning. It basically tells us what makes up the generalization error of a hypothesis or even a leaner and the involved tradeoff between the bias and variance. By knowing this, we can better choose the learner and hypothesis that achieves low generalization errors.
1. Assumptions
All examples are , from the joint distribution . In this article, uppercase letters represent the random variables, while lowercase letters represent constant values.
2. Three levels
- Prediction: Given a dataset , the hypothesis learned from this dataset, and the value of . The prediction can be evaluated by the error .
- Hypothesis: A hypothesis maps X to Y. It can be evaluated by the quality of predictions. It means that we want to measure the expected error of the hypothesis learned from some training set over the distribution .
- Learner: A learner maps dataset D to hypothesis h. It can be evaluated by the quality of hypothesis. It means that we want to measure the expectation of the performance of the hypotheses over different training datasets .
3. Error function
So let us focus on the case with error function.
4. Analysis for evaluation of predictions
For a given and , we measure our prediction over the distribution
- Fisrt term: the square of the bias
- Second term: the variance of the variable , which is a property of the distribution and thus totally out of our control.
And from the above derivation, we know that the hypothesis that minimizes the expected error of our prediction is
5. Analysis for evaluation of hypotheses
Given the training set , we measure the hypothesis learned from this set with the expected error of its predictions over the distribution of all possible sample pair .
6. Analysis for evaluation of learner
- The first two terms make the approximation error together.
- First term: the variance of our prediction over different training sets .
- Second term: the bias, i.e., how well our prediction approximates the true .
-
Third term: the expected variance of $$Y XP(X)$$, i.e. the random noise.
7. Furthe analysis
Let
where , be the error of a certain hypothesis on training set.
Now suppose we have a hypothesis space , is the hypothesis that minimizes the on training dataset , i.e.,
and is the hypothesis that minimizes the expected over ,i.e.
It is easy to see that
Then we consider about the expected error of this learner
- First term: the variance of our learner, as .
- Second term: the combination of the bias and noise and also reppresent the lowest possible MSE on future sample, as .
As is does not depend on training dataset D, we have
which means that the training error of is an unbiased estimate of its test error.
Then for , we have
So we have
- First term: variance of the leaner for future sample, as said above;
- Second term: the variance of the learner on training dataset;
- Third term: the leaner’s on training set, i.e. the expected training error. So we can easily conclude that for a learner, the expected test error optimal test error expected training error.
8. The Tradeoff
Increase the size of training set
- the optimal test error does not change
- the expected training error increases
- the training variance decreases
- the variance on future samples decreases
Thus the overall expected test error decreases. There is no tradeoff in this part.
Increase the repressiveness of the hypothesis space
- the optimal test error decreases
- the expected training error decreases
- the training variance increases
- the variance on future samples increases
We don’t know if the whole thing decreases or increases. It is a trade-off.
9. Reference
[1] Prof. Schuurmans’s notes on Bias-Variance