- Getting Data
Tags
- Data Management
- Data Visualisation
- Exploratory Analysis
- R Programming
In this post, I am going to run an exploratory analysis of the plant leaf dataset as made available by UCI Machine Learning repository at this link. The dataset is expected to comprise sixteen samples each of one-hundred plant species. Its analysis was introduced within ref. [1]. That paper describes a method designed to work in conditions of small training set size and possibly incomplete extraction of features.
This motivated separate processing of three feature types:
Those are then combined to provide an overall indication of the species (and associated probability). For an accurate description of those features, please see ref. [1] where the classification is implemented by a K-Nearest-Neighbor density estimator. Ref. [1] authors show the accuracy reached by K-Nearest-Neighbor classification for any combination of the datasets in use (see ref. [1] Table 2).
Packages
1 |
|
Getting Data
We can download the leaf dataset as a zip file by taking advantage of the following UCI Machine Learning url.
1 |
|
The files of interest are:
1 |
|
that can be so extracted.
1 |
|
We read them as CSV files. No header is originally provided.
1 |
|
We check the number of rows and columns of the resulting datasets.
1 |
|
We count the number of entries for each species within each dataset.
1 |
|
That in order to identify what species is associated to the missing entry inside the texture dataset.
1 |
|
Imputation
In the following, we fix the missing entry by imputation technique based on median. We suppose the missing entry is related with 16th sample of Acer Campestre texture data, which is the first plant species of our datasets. For the purpose, we take advantage of a temporary dataset made of first 15 entries and then we add such new row with median computed data. Afterwards, we “row-bind” such temporary dataset with the rest of the original texture samples.
1 |
|
The correlation plot is not so easy to interpret. Therefore we implement a procedure capable to filter out the significative and most relevant correlations. At the purpose, we use an helper function named as flattenCorrMatrix as can be found in ref. [3].
1 |
|
The following utility function is capable to extract a given feature from one of the available datasets and report the flatten correlation matrix providing with the significative correlation and whose relevance is above a certain absolute value as specified by the threshold parameter.
1 |
|
Here is what we get as correlation matrix for the margin_data dataset and the margin2 feature with a threshold equal to 0.7.
1 |
|
Let us have a look at a correlation matrix as item of such list.
1 |
|
1 |
|
1 |
|
1 |
|
1 |
|
1 |
|
Boxplots are shown to highlight differences in features among species. At the purpose, we define the following utility function.
1 |
|
Margin feature boxplot
For each margin feature, a boxplot as shown below can be generated. Herein, the boxplot associated to the margin1 feature.
1 |
|
If you are interested in having a summary report, you may take advantage of the following line of code.
1 |
|
Shape feature boxplot
We show the boxplot for shape features by considering the shape20 as example.
1 |
|
Texture feature boxplot
We show the boxplot for texture features by considering the texture31 as example.
1 |
|
Saving the current enviroment for further analysis.
1 |
|
References
-
Charles Mallah, James Cope and James Orwell, “PLANT LEAF CLASSIFICATION USING PROBABILISTIC INTEGRATION OF SHAPE, TEXTURE AND MARGIN FEATURES”, link
-
100 Plant Leaf Dataset
-
Correlation Matrix: a quick start guide
Related Post
- Proteomics Data Analysis (1/3): Data Acquisition and Cleaning
- Accessing Web Data (JSON) in R using httr
- Zomato Web Scraping with BeautifulSoup in Python
- Processing Huge Dataset with Python
- Finding the Popular Indian Blogging Platform by Web Scraping in Python
Related