LDA is a classification and dimensionality reduction techniques, which can be interpreted from two perspectives. The first is interpretation is probabilistic and the second, more procedure interpretation, is due to Fisher. The first interpretation is useful for understanding the assumptions of LDA. The second interpretation allows for a better understanding on how LDA performs dimensionality reduction.

The probabilistic interpretation

Each class (k \in {1, \ldots, K}) is assigned a prior (\hat{\pi}k) such that that (\sum{i=1}^k \hat{\pi}_k = 1). According to Bayes’ rule, the posterior probability is

[\rm{Pr}(G = k X = x) = \frac{f_k(x) \pi_k}{\sum_{l=1}^K f_l(x) \pi_l} ]

where (f_k(x)) is the density of (X) conditioned on (k). The maximum-a-posteriori (MAP) estimator simplifies to

[G(x) = \arg \max_k \rm{Pr}(G = k X = x) = \arg \max_k f_k(x) \pi_k ]

because the denominator is identical for all classes.

LDA assumes that the density is Gaussian:

[f_k(x) = { 2 \pi \Sigma_k }^{-1/2} \exp\left(-\frac{1}{2}(x – \mu_k)^T\Sigma_k^{-1}(x – \mu_k)\right)]
where (\Sigma_k) is the covariance matrix for the samples from class (k) and ( . ) is the determinant. LDA assumes that all classes have the same covariance matrix, i.e. (\Sigma_k = \Sigma\,, \forall k).

Plugging (f_k) into the classification function, we arrive at the classification function:

[G(x) = \arg \max_k \delta_k(x)]

where

[\delta_k(x) = x^T \Sigma^{-1} \mu_k -\frac{1}{2} \mu_k^T \Sigma^{-1} \mu_k + \log \pi_k ]

is the discriminant function for class (k). So, now that we have a classifier, how we can compute it?

To find the covariance matrix, we simply compute

[\hat{\Sigma} = \sum_{k=1}^K \frac{1}{N – K} \sum_{g_i = k} (x_i – \hat{\mu}_k) (x_i – \hat{\mu}_k)^T\,.]

Note that the deviation from the means is divided by (N-K), the degrees of freedom, to obtain an unbiased estimator.

The means of the classes, which are also called centroids, are defined by

[\hat{\mu}k = \frac{1}{N_k} \sum{g_i = k} x_i\,.]

The priors (\pi_k) are set to the prevalence ratio of the class-specific observations:

[\hat{\pi}_k = \frac{N_k}{N}\,.]

With this, we have defined all parameters required for the classifier.

The dimensionality reduction procedure of LDA involves the within-class variance, (W = \hat{\Sigma}), and the between-class variance, (B). The between-class variance indicates the deviation of centroids from the overall mean, (\hat{\mu} = \sum_{k=1}^K \hat{\pi}_k \hat{\mu}_k), and is defined as:

[B = \sum_{k=1}^K \hat{\pi}_k(\hat{\mu}_k – \hat{\mu}) (\hat{\mu}_k – \hat{\mu})^T\,.]

Finding a sequence of optimal substeps involves three steps:

  1. Compute the (K \times p) matrix (M) containing the centroids, (\mu_k), and determine the common covariance matrix (W).

  2. Compute (M^{\ast} = MW^{-\frac{1}{2}}) using the eigen-decomposition of W.

  3. Compute (B^{\ast}) (the between-class covariance) and its eigen-decomposition (B^{\ast} = V^{\ast} D_B {V^{\ast}}^T). The columns (v^{\ast}_l) of (V^{\ast}) define the coordinates of the reduced subspace.

The (l)-th discriminant variable (one of the (K-1) new dimensions) is determined by (Z_l = v_l^T X) with (v_l = W^{-\frac{1}{2}} v^{\ast}_l).

The interpretation of Fisher

Fisher’s LDA optimization criterion states that the centroids of the groups should be spread out as far as possible. This amounts to finding a linear combination (Z = a^T X) such that (a^T) maximizes the between-class variance relative to the within-class variance.

As before, the within-class variance is (W) is the pooled covariance matrix, (\hat{\Sigma}), which indicates the deviation of all observations from their class centroids. The between-class variance is defined according to the deviation of the centroids from the overall mean, as defined earlier. For (Z), the between class variance is (a^T B a) and the within-class variance is (A^T W a). Thus, LDA can be optimized through the Rayleigh quotient

[\max_a \frac{a^T B a}{a^T W a}\,, ]

which defines an optimal mapping of (X) to the new space (Z). Note that (Z \in \mathbb{R}^{1 \times p}), that is, the observations are mapped to a single dimension. To obtain additional dimensions, we need to solve the optimization problem for (a_1, \ldots, a_{K-1}) where each successive (a_k) is constructed to be orthogonal in (W) to the previous discriminant coordinates. This leads to the linear transformation (G = (Z_1^T, Z_2^T, \ldots, Z_{K-1}^T) \in \mathbb{R}^{p \times q}) with which we can map from (p) to (q) dimension via (X G). Why do we consider (K-1) projections? This is because the affine subspace that is spanned by the (K) centroids has a rank of at most (K-1).

Using Fisher’s formulation of LDA, classification involves two steps:

  1. Sphere the data using the common covariance matrix (\hat{\Sigma} = UDU^T) (eigendecomposition) such that (X^{\ast} = D^{-\frac{1}{2}} U^T X). In this way, the covariance of (X^{\ast}) becomes the identity matrix. By eliminating the covariance between the variables in this way, it becomes much easier to separate the classes in the transformed space.

  2. Classify observations (x_i) to the closest class centroid in the transformed space, taking into account the class priors (\pi_k). Here, the intuition is that an observation with equal distance to two centroids would be assigned to the class with the greater prior.

Reduced-rank LDA

LDA performs classification in a reduced subspace. When performing classification, we do not need to use all (K-1) dimensions and instead can choose a smaller subspace (H_l) with (l < K-1). When (l < K-1) is used, this is called reduced-rank LDA. The motivation for reduced-rank LDA is that classification basd on a reduced number of discriminant variables can improve performance on the test set when the model is overfitted.

Complexity of the LDA model

The number of effective parameters of LDA can be derived in the following way. There are (K) means, (\hat{\mu}_k) that are estimated. The covariance matrix does not require additional parameters because it is already defined by the centroids. Since we need to estimate (K) discriminant functions (to obtain the decision boundaries), this gives rise to (K) calculations involving the (p) elements. Additionally, we have (K-1) free parameters for the (K) priors. Thus, the number of effective LDA parameters is (Kp + (K-1)).

Summary of LDA

Here, I summarize the two perspectives on LDA and give a summary of the main properties of the model.

Probabilistic view

LDA uses Bayes’ rule to determine the posterior probability that an observation (x) belongs to class (k). Due to the normal assumption of LDA, the posterior is defined by a multivariate Gaussian whose covariance matrix is assumed to be identical for all classes. New points are classified by computing the discriminant function (\delta_k) (the enumerator of the posterior probability) and returning the class (k) with maximal (\delta_k). The discriminant variables can be obtained through eigen-decompositions of the within-class and between-class variance.

Fisher’s view

According to Fisher, LDA can be understood as a dimensionality reduction technique where each successive transformation is orthogonal and maximizes the between-class variance relative to the within-class variance. This procedure transforms the feature space to an affine space with (K-1) dimensions. After sphering the input data, new points can be classified by determining the closest centroid in the affine space under consideration of the class priors.

Properties of LDA

LDA has the following properties:

  • LDA assumes that the data are Gaussian. More specifically, it assumes that all classes share the same covariance matrix.

  • LDA finds linear decision boundaries in a (K-1) dimensional subspace. As such, it is not suited if there are higher-order interactions between the independent variables.

  • LDA is well-suited for multi-class problems but should be used with care when the class distribution is imbalanced because the priors are estimated from the observed counts. Thus, observations will rarely be classified to infrequent classes.

  • Similarly to PCA, LDA can be used as a dimensionality reduction technique. Note that the transformation of LDA is inherently different to PCA because LDA is a supervised method that considers the outcomes.

The phoneme data set

To exemplify linear discriminant analysis, we will use the phoneme speech recognition data set. This data set is useful for showcasing discriminant analysis because it involves five different outcomes.

1
2
3
4
library(RCurl)
f <- getURL('https://www.datascienceblog.net/data-sets/phoneme.csv')
df <- read.csv(textConnection(f), header=T)
print(dim(df))
1
## [1] 4509 259

The data set contains samples of digitized speech for five phonemes: aa (as the vowel in dark), ao (as the first vowel in water), dcl (as in dark), iy (as the vowel in she), and sh (as in she). In total, 4509 speech frames of 32 msec were selected. For each speech frame, a log-periodogram of length 256 was computed, on whose basis we want to perform speech recognition. The 256 columns labeled x.1 to x.256 identify the speech features, while the columns g and speaker indicate the phonemes (labels) and speakers, respectively.

For evaluating models later, we will assign each sample either into the training or the test set:

1
2
3
4
5
6
7
8
9
#logical vector: TRUE if entry belongs to train set, FALSE else
train <- grepl("^train", df$speaker)
# remove non-feature columns
to.exclude <- c("row.names", "speaker", "g")
feature.df <- df[, !colnames(df) %in% to.exclude]
test.set <- subset(feature.df, !train)
train.set <- subset(feature.df, train)
train.responses <- subset(df, train)$g
test.responses <- subset(df, !train)$g

Fitting an LDA model in R

We can fit an LDA model in the following way:

1
2
library(MASS)
lda.model <- lda(train.set, grouping = train.responses)

Let us take a moment to investigate the relevant components of the model:

  • prior contains the group priors (\pi_k) and counts the corresponding group counts, (N_k).

  • means is the centroid matrix (M \in \mathbb{R}^{K \times p}) whose components are the mean vectors, (\mu_k).

  • scaling is the (N \times (K-1)) matrix that transforms samples to the space defined by the (K-1) discriminant variables.

  • svd are the singular values, which indicate the ratio of between- and within-group standard deviations on the linear discriminant variables.

LDA as a visualization technique

We can transform the training data to the canonical coordinates by applying the transformation matrix on the scaled data. To obtain the same results as returned by the predict.lda function, we need to center the data around the weighted means first:

1
2
3
4
5
6
7
8
# 1. Manual transformation
# center data around weighted means & transform
means <- colSums(lda.model$prior * lda.model$means)
train.mod <- scale(train.set, center = means, scale = FALSE) %*% 
 lda.model$scaling
# 2. Use the predict function to transform:
lda.prediction.train <- predict(lda.model, train.set)
all.equal(lda.prediction.train$x, train.mod)
1
## [1] TRUE

We can use the first two discriminant variables to visualize the data:

1
2
3
4
# visualize the features in the two LDA dimensions
plot.df <- data.frame(train.mod, "Outcome" = train.responses)
library(ggplot2)
ggplot(plot.df, aes(x = LD1, y = LD2, color = Outcome)) + geom_point()

Plotting the data in the two LDA dimensions reveals three clusters:

  • Cluster 1 (left) consists of aa and ao phonemes

  • Cluster 2 (bottom right) consists of dcl and iy phonemes

  • Cluster 3 (top right) consists of sh phonemes

This indicates that two dimensions are not sufficient for differentiating all 5 classes. However, the clustering indicates that phonemes that are sufficiently different from one another can be differentiated very well.

We can also plot the mapping of training data onto all pairs of discriminant variables using the plot.lda function where the dimen parameter can be used to specify the number of considered dimensions:

1
2
3
4
library(RColorBrewer)
colors <- brewer.pal(8, "Accent")
my.cols <- colors[match(lda.prediction.train$class, levels(df$g))]
plot(lda.model, dimen = 4, col = my.cols)

Plotting the training data for all dimension pairs demonstrates that, by construction, capture most of the variance. Using the plot, we can obtain an intuition about the number of dimensions we should select for reduced-rank LDA. Remember that LD1 and LD2 confused aa with ao and dcl with iy. Thus, we need additional dimensions for differentiating these groups. Looking at the plots, it seems that we indeed need all of the four dimensions because dcl and iy are only well-separated in LD1 vs LD3, while aa and ao are only well-separated when LD4 is combined with any of the other dimensions.

To visualize the centroids of the groups, we can create a custom plot:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
plot_lda_centroids <- function(lda.model, train.set, response) {
 centroids <- predict(lda.model, lda.model$means)$x
 library(RColorBrewer)
 colors <- brewer.pal(8, "Accent")
 my.cols <- colors[match(lda.prediction.train$class, levels(df$g))]
 my.points <- predict(lda.model, train.set)$x
 no.classes <- length(levels(response))
 par(mfrow=c(no.classes -1, no.classes -1), mar=c(1,1,1,1), oma=c(1,1,1,10))
 for (i in 1:(no.classes - 1)) {
 for (j in 1:(no.classes - 1)) {
 y <- my.points[, i]
 x <- my.points[, j]
 cen <- cbind(centroids[, j], centroids[, i])
 if (i == j) {
 plot(x, y, type="n") 
 max.y <- max(my.points[, i])
 max.x <- max(my.points[, j])
 min.y <- min(my.points[, i])
 min.x <- min(my.points[, j])
 max.both <- max(c(max.x, max.y))
 min.both <- max(c(min.x, min.y))
 center <- min.both + ((max.both - min.both) / 2)
 text(center, center, colnames(my.points)[i], cex = 3)}
 else {
 plot(x, y, col = my.cols, pch = as.character(response), xlab ="", ylab="")
 points(cen[,1], cen[,2], pch = 21, col = "black", bg = colors, cex = 3)
 }
 }
 par(xpd = NA)
 legend(x=par("usr")[2] + 1, y = mean(par("usr")[3:4]) + 20, 
 legend = rownames(centroids), col = colors, pch = rep(20, length(colors)), cex = 3)
}
plot_lda_centroids(lda.model, train.set, train.responses)

Interpreting the posterior probabilities

Besides the transformation of the data to the discriminant variables, which is provided by the component x, the predict function also gives the posterior probabilities, which can be used for further interpretation of the classifier. For example:

1
2
3
4
5
posteriors <- lda.prediction.train$posterior # N x K matrix
# MAP classification for sample 1:
pred.class <- names(which.max(posteriors[1,])) # <=> lda.prediction.train$class[1]
print(paste0("Posterior of predicted class '", pred.class, 
 "' is: ", round(posteriors[1,pred.class], 2)))
1
## [1] "Posterior of predicted class 'sh' is: 1"
1
2
3
4
# what are the mean posteriors for individual groups?
res <- do.call(rbind, (lapply(levels(train.responses), function(x) apply(posteriors[train.responses == x, ], 2, mean))))
rownames(res) <- levels(train.responses)
print(round(res, 3)) 
1
2
3
4
5
6
## aa ao dcl iy sh
## aa 0.797 0.203 0.000 0.000 0.000
## ao 0.123 0.877 0.000 0.000 0.000
## dcl 0.000 0.000 0.985 0.014 0.002
## iy 0.000 0.000 0.001 0.999 0.000
## sh 0.000 0.000 0.000 0.000 1.000

The table of posteriors for individual classes demonstrates that the model is most uncertain about the phonemes aa and ao, which is in agreement with our expectations from the visualizations.

LDA as a classifier

As mentioned before, LDA has the benefit that we can select the number of canonical variables that are used for classification. Here, we will still showcase the use of reduced-rank LDA by performing classification with up to four canonical variables.

1
2
3
4
5
6
7
8
9
dims <- 1:4 # number of canonical variables to use
accuracies <- rep(NA, length(dims))
for (i in seq_along(dims)) {
 lda.pred <- predict(lda.model, test.set, dim = dims[i])
 acc <- length(which(lda.pred$class == test.responses))/length(test.responses)
 accuracies[i] <- acc
}
reduced.df <- data.frame("Rank" = dims, "Accuracy" = round(accuracies, 2))
print(reduced.df)
1
2
3
4
5
## Rank Accuracy
## 1 1 0.51
## 2 2 0.71
## 3 3 0.86
## 4 4 0.92

As expected from the visual exploration of the transformed space, the test accuracy increases with each additional dimension. Since LDA with four dimension obtains the maximal accuracy, we would decide to use all of the discriminant coordinates for classification.

To interpret the model, we can visualize the performance of the full-rank classifier:

1
2
3
4
5
lda.pred <- predict(lda.model, test.set)
plot.df <- data.frame(lda.pred$x[,1:2], "Outcome" = test.responses, 
 "Prediction" = lda.pred$class)
ggplot(plot.df, aes(x = LD1, y = LD2, color = Outcome, shape = Prediction)) +
 geom_point()

In the plot, the expected phonemes are shown by different colors, while the model predictions are shown through different symbols. A model with 100% accuracy would assign a single symbol to each each color. Thus, incorrect predictions are revealed when a single color exhibits different symbols. Using the plot, we quickly see that most confusions occur when observations labeled as aa are incorrectly classified as ao and vice versa.