PCA is a standard technique for visualizing high dimensional data and for data pre-processing. PCA reduces the dimensionality (the number of variables) of a data set by maintaining as much variance as possible.
Illustrated are three-dimensional gene expression data which are mainly located
within a two-dimensional subspace. PCA is used to visualize these data by
reducing the dimensionality of the data:
The three original variables (genes) are reduced to a lower number
of two new variables termed principal components (PCs).
Left: Using PCA, we can identify the two-dimensional plane that optimally describes
the highest variance of the data.
This two-dimensional subspace can then be rotated and presented as a two-dimensional
component space (right).
Such two-dimensional visualization of the samples allow
us to draw qualitative conclusions about the separability
of experimental conditions (marked by different colors).
Principal component analysis (PCA) rotates the original data space such
that the axes of the new coordinate system point into the directions of highest variance of the data.
The axes or new variables are termed principal components (PCs) and are ordered by variance:
The first component, PC 1, represents the direction of the
highest variance of the data. The direction of the second component, PC 2, represents the highest of the
remaining variance orthogonal to the first component. This can be naturally extended
to obtain the required number of components which together span a component space covering the desired amount of variance.
Since components describe specific directions in the data space, each component depends by certain amounts on each of the original variables: Each component is a linear combination of all original variables.
Low variance can often be assumed to represent undesired
background noise.
The dimensionality of the data can therefore be reduced, without loss of relevant information, by extracting a lower dimensional component space covering the highest variance. Using a lower number of principal components instead of the high-dimensional original data is a common pre-processing step that often improves results of subsequent analyses such as classification.
For visualization, the first and second component can be
plotted against each other to obtain a two-dimensional representation of the data that
captures most of the variance (assumed to be most of the relevant information), useful to analyze and interpret the structure of a data set.