Math
]
Elements of Statistical Learning
Supervised Learning
-
Kernel methods use weights that decrease smoothly to zero with distance from the target point, rather than the effective 0/1 weights used by the k-nearest neighbors.
-
In high-dimensional spaces the distance kernels are modified to emphasize some variable more than others.
-
Local regression fits linear models by locally weighted least squares, rather than fitting constants locally.
-
Linear models fit to a basis expansion of the original inputs allow arbitrarily complex models.
-
Projection pursuit and neural network models consist of sums of non-linearly transformed linear models.
K-nearest neighbors and least squares
Both k-nearest neighbors and least squares end up appproximating conditional expectations by averages. stable but biased less stable but less biased
Differences
Linear Methods for Regression
Linear Methods for Classification
Model selection: find a subset of the variables that are suffiient for explaining their joint effects on the outcome variable.
-
Method 1: drop the least significant coefficient and refit the model repeatedly until no further terms can be dropped from the model
-
Method 2: refit each of the models with one variable removed, and then perform an analysis of deviance to decide which variable to exclude
Generally, logistic regression is a safer and more robust bet than the LDA model, relying on fewer assumptions.
Basis Expansions and Regularization
A kernel is chosen because it represents an expansion of polynomials and can conveniently compute high-dimensional inner product.
A chosen because of its functional form in the \(f(x) = \sum_{i=1}^N \alpha_i K(x,x_i)\)
Modern image compression is often performed using two-dimensional wavelet representations.
Kernel Smoothing Methods
Model Assessment and Selection
Model selection: estimating the performance of different models in order to choose the best one
Model assessment: having chosen a final model, estimating its prediction error on new data
Because the relative size of the error is what matters, for comparison between models, in-sample error is convenient and often leads to effective model selection
Model Inference and Averaging
EM algorithm is popular for simplifying diffcult maximum likelihood problems.
Additive Models, Trees, and Related Methods
Generalized additive model has the form
\[E(Y|X_1,X_2,...,X_p) = \alpha + f_1(X_1)+f_2(X_2)+...+f_p(X_p)\]Approaches to features missing completely at random,
- Discard observations with any missing values
can be used if the relative amount of msising data is small
- Rely on the learning algorithm to deal with missing values in its training phase
CART is one learning algorithm that deals effectively with missing values through surrogate splits
- Impute all missing values before training This is necessary for most learning methods. The simplest tactic is to impute the missing value with the mean or median of the non-missing values for that feature.
Boosting and Additive Trees
For data mining applications, it is not enough to simply produce predictions. It is also desirable to have information providing qualitative understanding of the relationship between joint values of the input variables and the resulting predicted response value.
Black box methods such as neural networks, which can be quite useful in purely predictive settings such as pattern recognition, are far less useful for data mining.
Neural Networks
Support Vector Machines and Flexible Discriminants
Prototype Methods and Nearest-Neighbors
Unsupervised Learning
K-means clustering algorithm represents a key tool in the apparently unrelated area of image and signal compression, particularly in vector quantization