- Advanced Modeling
Tags
- Linear Regression
- Principal Component Analysis
- R Programming
In this post, I am going to build a statistical learning model as based upon plant leaf datasets introduced in part one of this tutorial. We have available three datasets, each one providing sixteen samples each of one-hundred plant species.
The features are:
Specifically, I will take advantage of Discrimination Analysis for classification purposes and I will compare my results with the ones of ref. [1] table 2, as herein reported.
Ref. [1] authors show the accuracy reached by KNN (proportional and weighted proportional kernel density) for any combination of the datasets in use. Their top accuracy is 96% and it is achieved when using all three datasets. In the present tutorial I am going to compare my results with the ones obtained in ref. [1].
Packages
1 |
|
Classification Models
Loading previous post environment where datasets are available.
1 |
|
We define an utility function to gather all the steps needed for model building by caret package (ref. [3]), specifically:
-
train partition as 70% training and 30% test samples
-
train control specifying repeated cross-validation with ten folds
-
features set, all features besides species and id
-
train dataset
-
test dataset
-
linear discriminant analysis fit, with preprocessing to compensate for spatial distribution asymmetries and to center and scale values
-
prediction computation
-
confusion matrix computed on the test dataset
-
result as made of the linear discriminant analysis fit and the test dataset confusion matrix
As pre-processing step, also PCA (Principal Component Analysis) is specified. That allows for slightly better accuracy results compared to non PCA preprocessing scenario (see also ref. [4] for further insights about combining PCA and LDA).
Among all available Discriminant Analysis models within caret package, the lda appears to be the most efficient with time and as well providing with good results, as we will see at the end. For details about Linear Discriminant Analysis, please see ref. [5] and [6].
1 |
|
Margin dataset model
We build our model for margin dataset only.
1 |
|
Model fit results.
1 |
|
Further details can be printed out by:
1 |
|
that we do not show for brevity.
Shape dataset model
We build our model for shape dataset only.
1 |
|
Model fit results.
1 |
|
Texture dataset model
We build our model for texture dataset only.
1 |
|
Model fit results.
1 |
|
Margin+Shape datasets model
We build our model for margin and shape datasets.
1 |
|
Model fit results.
1 |
|
Margin+Texture datasets model
We build our model for margin and texture datasets.
1 |
|
Model fit results.
1 |
|
Shape+Texture datasets model
We build our model for shape and texture datasets.
1 |
|
Model fit results.
1 |
|
Margin+Shape+Texture datasets model
We build our model for all three datasets.
1 |
|
Model fit results.
1 |
|
Final Results
Finally, we gather all results in a data frame and show for comparison. The V symbol indicates when a dataset is selected for model building.
1 |
|
Here are ref.[1] results compared to ours.
As we can see, with the only exception of the shape dataset scenario, for all other cases as defined by the datasets in use, Linear Discriminant Analysis achieves higher accuracy with respect ref. [1] K-Nearest-Neighbor based models.
References
-
Charles Mallah, James Cope and James Orwell, “PLANT LEAF CLASSIFICATION USING PROBABILISTIC INTEGRATION OF SHAPE, TEXTURE AND MARGIN FEATURES” link
-
Leaf Dataset
-
Caret package – train models by tag
-
Combining PCA and LDA
-
Linear Discriminant Analysis
-
Linear Discriminant Analysis – Bit by Bit
Related Post
- NYC buses: C5.0 classification with R; more than 20 minute delay?
- NYC buses: Cubist regression with more predictors
- NYC buses: simple Cubist regression
- Visualization of NYC bus delays with R
- Explaining Black-Box Machine Learning Models – Code Part 2: Text classification with LIME
Related