III. CLUSTERING
Multi-Variate analysis has a very wide application in unsupervised learning. Clustering has the maximum applications of multi-variate understanding and visualizations. Many times we prefer to perform clustering before applying the regression algorithms to get more accurate predictions for each cluster.
We will do hierarchical clustering for our case study, using the following steps:
1. Seperating the columns to be analyzed
Let’s get a sample data comprising of all the items whose expenditure is to be analyzed i.e all columns except Channel and Region – like fresh, milk, grocery, frozen etc.
1 |
|
1 |
|
1 |
|
1 |
|
2. Scaling the data, to get all the columns into same scale. This is done using calculation of z-score:
1 |
|
3. Identifying the appropriate number of clusters for k-means clustering
1 |
|
1 |
|
1 |
|
1 |
|
1 |
|
1 |
|
Though 2 clusters / 3 clusters show the maximum variance. In this case-study we are deviding the data into 10 clusters to get more specific results, visualizations and target strategies.
We can also use within-sum-of-squares method to find the number of clusters.
Also read:Data Exploration and Uni-Variate AnalysisBi-Variate AnalysisData-Cleaning, Categorization and Normalization
4. Finding the most suitable number of clusters through wss method
1 |
|
1 |
|
5. Plot wss using ggplot2 Library
We will plot the within-sum-of-squares distance using ggplot library:
1 |
|
1 |
|
1 |
|
We notice that after cluster 10, the wss distance increases drastically. So we can choose 10 clusters.
6. Checking the Attributes of k-means Object
We will check the centers and size of the clusters
1 |
|
1 |
|
1 |
|
1 |
|
1 |
|
1 |
|
8. Population-Wise Summaries
1 |
|
1 |
|
1 |
|
1 |
|
9. Z-Value Normalisation
z score = (cluster_mean-population_mean)/population_sd
1 |
|
1 |
|
Where-ever we have very high z-scores it indicates, that cluster is different from the population. * Very-high z-score for fresh in cluster 8 and 9* Very-high z-score for milk in cluster 5,6 and 9* Very-high z-score for grocery in cluster 5 and 6* Very-high z-score for frozen products in cluster 7, 9 and 10* Very-high z-score for detergents paper in cluster 5 and 6
We would like to find why these clusters are so different from the population