From Pigbert Wiki
Feature Extraction
To find k dimensions that are the combination of the original d dimensions (k<d) with minimum loss of information.
Principal Components Analysis (PCA)
- PCA is mathematically defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.
- Unsupervised learning. The criterion to maximize is the variance.
- PCA can be used for dimensionality reduction in a data set by retaining those characteristics of the data set that contribute most to its variance, by keeping lower-order principal components and ignoring higher-order ones. Such low-order components often contain the "most important" aspects of the data.
- The eigenvector with the highest eigenvalue is the direction that has the highest variance. (It is possible to calculate the eigenvectors and eigenvalues directly from data without explicitly calculating the covariance matrix.)
- Uses spectral decomposition of the matrix of eigenvectors.
- Ways to cut dimensionality:
- Discard the eigenvectors whose eigenvalues are less than the average input variance.
- Plot the scree graph, which shows the variance explained as a function of the number of eigenvectors kept, and cut on the "elbow".
- Applications:
- image and speech processing tasks where nearby inputs(in space or time) are highly correlated.
Factor Analysis (FA)
Multidimensional Scaling (MDS)
Linear Discriminant Analysis (LDA)
Latent Semantic Analysis (LSA)
Feature Selection
To find k of the d dimensions that give the most information and discard the others.
Subset Selection Method
- Purpose: select the best (smallest and most informative) subset of the set of features.
- Complexity: there are 2^d possible subsets of d variables.
- Applications:
- not for face recognition at the pixel level.
(sequential) forward selection
- start with no variables and add them one by one, at each step adding the one that decreases the error the most.
- uses a separate validation set to check the error.
- stop if adding any feature does not decrease the error significant enough.
- Complexity: O(d^2).
(sequential) backward selection
- start with all variables and remove them one by one, at each step removing the one that decreases the error the most.
- uses a separate validation set to check the error.
- stop if deleting any feature does not decrease the error significant enough.
- Complexity: O(d^2).
floating search
- the number of added features and removed features can change at each step.
- Complexity: higher.
Motivations
- Reduce computational and storage complexity;
- Reduce noise in the data.
- Improve interpretability.
- Make it easier for manual analysis and knowledge extraction.