2.1 Instroduction

Supervised Learning: predict output values from input values.

Some equivalent terminologies:

  • Input: predictors, independent variables, features
  • Ouput: responses, dependent variables

2.2 Variable Types and Terminology

Three variable types

  • Quantitative (continuos)
    • ratio defined
    • ratio not defined
  • Qualitative (discrete, categorical, factors)
  • Ordered categorical

We usually do some coding to the qualitative variables

  • target: for binary varibles, we use 0-1 or (-1)-(+1) coding
  • dummy variables: for a variable with possible values, we use binary variables to code it

Notations:

Generally, upper case letters represent the random variables, while lower case ones indicate some observed values, which are regarded as fixed or constant.

  • : input variable or vector, component of which is written as , while observed value of is
  • : output varible. for quantative and for qualitative
  • : matrix, training data, consist of , . The component is

learning task

Given the value of an input vector , make a good prediction of the output , denoted by

Least Squares and Nearest Neighbors

  • Linear model
    • huge assumptions about the structure
    • stable but maybe not very accurate predictions
  • -nearest neighbor
    • few assumptions, if any, about the structure
    • accurate but not stable predictions (sensitive to small changes in training dataset)

Linear Models and Least Squares

Linear Models

Generally, we use “hat” over something to represent an estimate of it, such as, using to estimate .

Given a input vector , we predict the output via the model,

is known as intercept or bias. If we include as a feature, then

if the output is scaler, then the dimension of is . Generally, the output can be -vector, then the dimension of should be .

In the dimensional space, is a hyperplane. If is included, the hyperplane includes the origin and thus is a subspace.

Least Squares

We write instead of , because we assume that there is a “perfect” . Our task is to estimate the by fitting the data.

One way to do that is least square fit. Define

Then we do the fit as follows

To get the solution, we take derivative of w.r.t. and get

Then we set the derivate to zero, and if is nonsingular, we have

And then we have the fitte value to be

So the fitted surface is actually , parameterized by .

Simulation Experiment

The input is a two dimensional vector and the ouput variable is a binary variable. We make prediction as

where and the decision boundary is $${x x\beta = 0.5}$$.

Let us consider about how the data is generated

  • Scenario 1: The training data in each class were generated from bivariate Gaussian distributions with uncorrelated components and different means, i.e. several independent bivariate Gaussian distributions.

[put a figure here]

  • Scenario 2: We have ten low-variance Gaussian distribution for each class respectively. Their means are distributed as Gaussian too. The way to generate a sample is to first select one Gaussian distribution out of the ten and then draw a point from that distribution.

For scenario 1, a linear boundary is the best we can do. For scenario 2, a nonlinear one is better.

2.3.2 Nearest-Neighbor Methods

where is the set of training samples that are nearest to .

If the output variable is categorical, we can still put a threshold (decision boundary) on to decide . A majortiy vote is the case where the decision boundary is $${x \hat{Y}(x) = 0.5}k$$NN is irregular.

Though it seems we only have one parameter in NN instead of in Linear model. Actually the effective number of parameter in NN is , which is generally greater than .

2.3.3 From Least Squares to Nearest Neighbors

Least Squares method has stronly replied the linear assumption, which makes it a simple model. This simplicity makes it less sensitive to the change of training dataset and thus enjoy a low variance, but potentially a high bias. For NN, it hardly relies on any strict assumption, which makes it enjoy a low bias, but a high variance, i.e. quite sensitive to the change of training set (for small ).

For senario 1, linear model is more appropriate, while NN works better for senario 2.

Many most popular techniques come from these two simple procedure. Below list some popular ways of enhancing these two methods:

  • Kernel methods use weights that decrease smoothly to zero with distance from the target point, rather than the effective 0/1 weights used by k-nearest neighbors.
  • In high-dimensional spaces the distance kernels are modified to emphasize some variable more than others.
  • Local regression fits linear models by locally weighted least squares, rather than fitting constants locally.
  • Linear models fit to a basis expansion of the original inputs allow arbitrarily complex models.
  • Projection pursuit and neural network models consist of sums of non-linearly transformed linear models.

2.4 Statistical Decision Theory

Let be the input vector, with real valued variables and . And the underlied joint distribution is . Then suppose that our loss function is , i.e. the squared loss. Then we have a measure for function , which we can use as a criterion to selecte ,

Then we can condition it on

By minimizing it pointwisely, we get the solution

This is also known as the regression function. It is the best prediction of at any point , given the perfect knowledge of , w.r.t. the averaged squared error.

Note if we change the loss function, the solution is different. For example, if we use $$L(Y) = Y-f(x) $$, the solution will be the conditional median instead of the conditional mean. This is actually more robust than least squares, as it is much less sensitive to the outlier than least square fit. However, as the derivate of it is not continuos at zero, which makes it hard to optimize and lose the closed form solution.

Nearest-Neighbor

The nearest-neighbor methods exactly follow this thought in an approximated way,

And the two approximations used here are

  • expectation is approximated by averaging over sample data;
  • conditioning at a point is relaxed to conditioning on some region “close” to the target point.
It is easy to see that as , we have $$\hat{f}(x) \to E[Y X = x]p$$ increases, the data samples we have will be scattered very sparsely in the space, which leads to fail of using nearest neighbor as a surrorgate for conditioning, i.e. the second approximation is not good. And the convergence rate also decreases.

Linear regression

Linear regression is a model-based approach. It means that we have the belief/assumption that the regression function is approximately linear in its arguments:

If this assumption is satisfied, we have

Then by differentiating it w.r.t and set the gradient to zero, we can get the closed form solution

Thus actually both nearest-neighbor and linear model are trying to approximate the condtional expectations by averages.

2.5 Local Methods in High Dimensions

Comments