MULTI-VARIATE ANALYSIS

III. CLUSTERING

Multi-Variate analysis has a very wide application in unsupervised learning. Clustering has the maximum applications of multi-variate understanding and visualizations. Many times we prefer to perform clustering before applying the regression algorithms to get more accurate predictions for each cluster.

We will do hierarchical clustering for our case study, using the following steps:

1. Seperating the columns to be analyzed

Let’s get a sample data comprising of all the items whose expenditure is to be analyzed i.e all columns except Channel and Region – like fresh, milk, grocery, frozen etc.

1
names(data)
1
2
3
## [1] "Channel" "Region" "Fresh" 
## [4] "Milk" "Grocery" "Frozen" 
## [7] "Detergents_Paper" "Delicassen"
1
2
sample<-data[,3:8]
names(sample)
1
2
## [1] "Fresh" "Milk" "Grocery" 
## [4] "Frozen" "Detergents_Paper" "Delicassen"

2. Scaling the data, to get all the columns into same scale. This is done using calculation of z-score:

1
2
sample_scale=scale(sample, center=TRUE, scale=TRUE)
sample=cbind(sample, sample_scale)

3. Identifying the appropriate number of clusters for k-means clustering

1
2
3
4
5
6
library(NbClust)
noculs <- NbClust(sample_scale, distance = "euclidean", 
min.nc = 2, max.nc = 12, method = "kmeans") 
![](http://dimensionless.in/wp-content/uploads/2017/03/download-1-300x214.png)


1
2
3
4
5
6
7
8
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a 
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot. 
## 
![](http://dimensionless.in/wp-content/uploads/2017/03/download-300x214.png)


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
## *** : The D index is a graphical method of determining the number of clusters. 
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure. 
## 
## ******************************************************************* 
## * Among all indices: 
## * 6 proposed 2 as the best number of clusters 
## * 4 proposed 3 as the best number of clusters 
## * 3 proposed 4 as the best number of clusters 
## * 3 proposed 5 as the best number of clusters 
## * 1 proposed 7 as the best number of clusters 
## * 4 proposed 10 as the best number of clusters 
## * 3 proposed 12 as the best number of clusters 
## ***** Conclusion ***** 
## * According to the majority rule, the best number of clusters is 2 
## 
## 
## *******************************************************************
1
table(noculs$Best.nc[1,])
1
2
3
## 
## 0 2 3 4 5 7 10 12 
## 2 6 4 3 3 1 4 3
1
2
3
4
5
barplot(table(noculs$Best.nc[1,]), xlab="Numer of Clusters", ylab="Number of Criteria",
main="Number of Clusters Chosen")
![](http://dimensionless.in/wp-content/uploads/2017/03/download-2-1-300x214.png)


Though 2 clusters / 3 clusters show the maximum variance. In this case-study we are deviding the data into 10 clusters to get more specific results, visualizations and target strategies.

We can also use within-sum-of-squares method to find the number of clusters.

Also read:Data Exploration and Uni-Variate AnalysisBi-Variate AnalysisData-Cleaning, Categorization and Normalization

4. Finding the most suitable number of clusters through wss method

1
2
3
4
5
6
wss<-1:15
for (i in 1:15)
{
wss[i]<-kmeans(sample[,7:12],i)$tot.withinss
}
wss
1
2
3
## [1] 2634.0000 1949.3479 1638.5026 1353.8722 1240.8142 1125.9104 862.4485
## [8] 773.6659 689.2003 597.5906 564.9160 526.6801 482.9415 481.6230
## [15] 468.0860

5. Plot wss using ggplot2 Library

We will plot the within-sum-of-squares distance using ggplot library:

1
2
3
4
number<-1:15
library(ggplot2)
dat<-data.frame(wss,number)
dat 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
## wss number
## 1 2634.0000 1
## 2 1949.3479 2
## 3 1638.5026 3
## 4 1353.8722 4
## 5 1240.8142 5
## 6 1125.9104 6
## 7 862.4485 7
## 8 773.6659 8
## 9 689.2003 9
## 10 597.5906 10
## 11 564.9160 11
## 12 526.6801 12
## 13 482.9415 13
## 14 481.6230 14
## 15 468.0860 15
1
2
3
4
5
p<-ggplot(dat,aes(x=number,y=wss),color="red")
p+geom_point()+scale_x_continuous(breaks=seq(1,20,1))+scale_y_continuous(breaks=seq(500,3000,500))
![](http://dimensionless.in/wp-content/uploads/2017/03/download-3-300x214.png)


We notice that after cluster 10, the wss distance increases drastically. So we can choose 10 clusters.

6. Checking the Attributes of k-means Object

We will check the centers and size of the clusters

1
attributes(fit.km)
1
2
3
4
5
6
7
## $names
## [1] "cluster" "centers" "totss" "withinss" 
## [5] "tot.withinss" "betweenss" "size" "iter" 
## [9] "ifault" 
## 
## $class
## [1] "kmeans"
1
fit.km$centers
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
## Fresh Milk Grocery Frozen Detergents_Paper
## 1 0.7918828 0.5610464 -0.01128859 9.24203651 -0.4635194
## 2 0.7108765 -0.2589259 -0.27203848 -0.20929524 -0.3530279
## 3 1.9645810 5.1696185 1.28575327 6.89275382 -0.5542311
## 4 1.0755395 5.1033075 5.63190631 -0.08979632 5.6823687
## 5 -0.4634131 -0.4587350 -0.52211820 -0.28548405 -0.4501496
## 6 -0.5526133 0.4194314 0.51079792 -0.30950259 0.4915742
## 7 0.2370851 -0.2938321 -0.42497950 1.43457516 -0.4926072
## 8 3.0061448 1.6650889 0.98706324 1.09668928 0.1840989
## 9 2.7898027 -0.3572603 -0.37008492 0.46561695 -0.4540498
## 10 -0.5039508 1.4484898 1.97086631 -0.27885432 2.2070700
## Delicassen
## 1 0.932103121
## 2 0.051489721
## 3 16.459711293
## 4 0.419817401
## 5 -0.269307346
## 6 -0.010691096
## 7 -0.019489863
## 8 4.237120807
## 9 -0.008907628
## 10 0.207301087
1
fit.km$size
1
## [1] 2 78 1 5 167 93 43 4 20 27

8. Population-Wise Summaries

1
2
3
4
options(scipen=999)
popln_mean=apply(sample[,1:6],2,mean)
popln_sd=apply(sample[,1:6],2,sd)
popln_mean
1
2
3
4
## Fresh Milk Grocery Frozen 
## 12000.298 5796.266 7951.277 3071.932 
## Detergents_Paper Delicassen 
## 2881.493 1524.870
1
popln_sd
1
2
3
4
## Fresh Milk Grocery Frozen 
## 12647.329 7380.377 9503.163 4854.673 
## Detergents_Paper Delicassen 
## 4767.854 2820.106

9. Z-Value Normalisation

z score = (cluster_mean-population_mean)/population_sd

1
2
3
4
5
6
7
8
9
list<-names(cmeans)
for(i in 1:length(list))
{
y<-(cmeans[,i+1] - popln_mean[i])/popln_sd[i]
cmeans<-cbind(cmeans,y)
names(cmeans)[i+length(list)]<-paste("z",list[i+1],sep="_")
}
cmeans=cmeans[,-length(names(cmeans))]
cmeans[8:length(names(cmeans))]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
## z_Fresh z_Milk z_Grocery z_Frozen z_Detergents_Paper
## 1 0.7918828 0.5610464 -0.01128859 9.24203651 -0.4635194
## 2 0.7108765 -0.2589259 -0.27203848 -0.20929524 -0.3530279
## 3 1.9645810 5.1696185 1.28575327 6.89275382 -0.5542311
## 4 1.0755395 5.1033075 5.63190631 -0.08979632 5.6823687
## 5 -0.4634131 -0.4587350 -0.52211820 -0.28548405 -0.4501496
## 6 -0.5526133 0.4194314 0.51079792 -0.30950259 0.4915742
## 7 0.2370851 -0.2938321 -0.42497950 1.43457516 -0.4926072
## 8 3.0061448 1.6650889 0.98706324 1.09668928 0.1840989
## 9 2.7898027 -0.3572603 -0.37008492 0.46561695 -0.4540498
## 10 -0.5039508 1.4484898 1.97086631 -0.27885432 2.2070700
## z_Delicassen
## 1 0.932103121
## 2 0.051489721
## 3 16.459711293
## 4 0.419817401
## 5 -0.269307346
## 6 -0.010691096
## 7 -0.019489863
## 8 4.237120807
## 9 -0.008907628
## 10 0.207301087

Where-ever we have very high z-scores it indicates, that cluster is different from the population. * Very-high z-score for fresh in cluster 8 and 9* Very-high z-score for milk in cluster 5,6 and 9* Very-high z-score for grocery in cluster 5 and 6* Very-high z-score for frozen products in cluster 7, 9 and 10* Very-high z-score for detergents paper in cluster 5 and 6

We would like to find why these clusters are so different from the population