23rd October 2020 • Michael Ng
In Machine Learning, many algorithms face a problem called “The Curse of Dimensionality” where the data has too many features. As the number of features grow, the volume of data needed increases exponentially, which may not always be available. One solution to this problem is to use dimensionality reduction techniques such as Principal Component Analysis (PCA) or Linear Discriminant Analysis (LDA).
PCA is an unsupervised learning method that group variables which are correlated to each other while maintaining as much variance of the dataset as possible. It does not require any of the variables to have any class labels. A good resource can be found here.
LDA on the other hand, is a supervised learning method that rather focuses on maximizing separability between classes of variables. LDA maximizes the separation by creating a new linear axis and projecting the data points on that axis.
For instance, suppose that we plotted the relationship between two variables where each color represents a different class. If we were to project on X1 or on X2, i.e. Collapse the points into one axis and ignore the other, there will be an overlap between the two classes.
Figure 1. If we were to project on either axis, there will be overlap
Intuitively, we can create lines that allow the classes to be separate due to the combination of both X1 and X2. Hence both X1 and X2 must be included into our formula to separate the two classes well. Using LDA, we will be able to make a new axis and project the data points such that the two classes are separated from each other. The figure below shows the new line that LDA will create.
Figure 2. The two classes are now separate
What is the mathematics behind the line? There are two metrics that are needed to ensure separability. The first is the distance between the mean vectors of each class. In other words, we want to maximize the distance between the ‘centers’ of each class (represented by ).
As the points in each class are scattered around the graph, there could be some outliers that overlap with the other class if we were solely focused on maximizing the distance between the ‘centers’. Ideally, we would want the distance to be large and the scatter or ‘variance’ to be small (represented by ).
Once you have both the centers and their respective variances, we can find the optimal function to ensure separability.
There you have it! You can reduce dimensions from a 2-dimensions down to 1. For larger dimensions, the techniques are similar but will require knowledge on matrices and eigenvectors. In fact, LDA can also be used to predict classes under certain assumptions such that all inputs follow the same variance and that the inputs follow a Gaussian or Normal distribution.
When do we use LDA instead of PCA? LDA is more optimal for classification, where your output consists of different classes. The features that discriminate the classes are kept, ensuring separability between the classes. Hence, class labels are essential to identify such features and is a useful tool for supervised learning problems. This contrasts with PCA, where features with high variance are kept instead.
If you would like to know more, here are some useful resources to check out!
LDA implementation in Python: https://sebastianraschka.com/Articles/2014_python_lda.html
Tutorial video on LDA: https://www.youtube.com/watch?v=azXCzI57Yfc