Scatterplot matrices (pair plots) with cdata and ggplot2

In my previous post, I showed how to use cdata package along with ggplot2‘s faceting facility to compactly plot two related graphs from the same data. This got me thinking: can I use cdata to produce a ggplot2 version of a scatterplot matrix, or pairs plot?

A pairs plot compactly plots every (numeric) variable in a dataset against every other one. In base plot, you would use the pairs() function. Here is the base version of the pairs plot of the iris dataset:

1
2
3
4
pairs(iris[1:4], 
 main = "Anderson's Iris Data -- 3 species",
 pch = 21, 
 bg = c("#1b9e77", "#d95f02", "#7570b3")[unclass(iris$Species)])

There are other ways to do this, too:

1
2
3
4
5
6
7
8
9
10
11
# not run

library(ggplot2)
library(GGally)
ggpairs(iris, columns=1:4, aes(color=Species)) + 
 ggtitle("Anderson's Iris Data -- 3 species")

library(lattice)
splom(iris[1:4], 
 groups=iris$Species, 
 main="Anderson's Iris Data -- 3 species")

But I wanted to see if cdata was up to the task. So here we go….

First, load the packages:

1
2
library(ggplot2)
library(cdata)

To create the pairs plot in ggplot2, I need to reshape the data appropriately. For cdata, I need to specify what shape I want the data to be in, using a control table. See the last post for how the control table works. For this task, creating the control table is slightly more involved.

Here, I specify the variables I want to plot.

1
meas_vars <- colnames(iris)[1:4]

The function expand_grid() returns a data frame of all combinations of its arguments; in this case, I want all pairs of variables.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# the data.frame() call strips the attributes from
# the frame returned by expand.grid()
controlTable <- data.frame(expand.grid(meas_vars, meas_vars, 
 stringsAsFactors = FALSE))
# rename the columns
colnames(controlTable) <- c("x", "y")

# add the key column
controlTable <- cbind(
 data.frame(pair_key = paste(controlTable[[1]], controlTable[[2]]),
 stringsAsFactors = FALSE),
 controlTable)

controlTable
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
## pair_key x y
## 1 Sepal.Length Sepal.Length Sepal.Length Sepal.Length
## 2 Sepal.Width Sepal.Length Sepal.Width Sepal.Length
## 3 Petal.Length Sepal.Length Petal.Length Sepal.Length
## 4 Petal.Width Sepal.Length Petal.Width Sepal.Length
## 5 Sepal.Length Sepal.Width Sepal.Length Sepal.Width
## 6 Sepal.Width Sepal.Width Sepal.Width Sepal.Width
## 7 Petal.Length Sepal.Width Petal.Length Sepal.Width
## 8 Petal.Width Sepal.Width Petal.Width Sepal.Width
## 9 Sepal.Length Petal.Length Sepal.Length Petal.Length
## 10 Sepal.Width Petal.Length Sepal.Width Petal.Length
## 11 Petal.Length Petal.Length Petal.Length Petal.Length
## 12 Petal.Width Petal.Length Petal.Width Petal.Length
## 13 Sepal.Length Petal.Width Sepal.Length Petal.Width
## 14 Sepal.Width Petal.Width Sepal.Width Petal.Width
## 15 Petal.Length Petal.Width Petal.Length Petal.Width
## 16 Petal.Width Petal.Width Petal.Width Petal.Width

The control table specifies that for every row of iris, sixteen new rows get produced, one for each possible pair of variables. The column pair_key will be the key column in the new data frame; there’s one key level for every possible pair of variables. The columns x and y will be the value columns in the new data frame — these will be the columns that we plot.

Now I can create the new data frame, using rowrecs_to_blocks(). I’ll also carry along the Species column to color the points in the plot.

1
2
3
4
5
6
iris_aug = rowrecs_to_blocks(
 iris,
 controlTable,
 columnsToCopy = "Species")

head(iris_aug)
1
2
3
4
5
6
7
## Species pair_key x y
## 1 setosa Sepal.Length Sepal.Length 5.1 5.1
## 2 setosa Sepal.Width Sepal.Length 3.5 5.1
## 3 setosa Petal.Length Sepal.Length 1.4 5.1
## 4 setosa Petal.Width Sepal.Length 0.2 5.1
## 5 setosa Sepal.Length Sepal.Width 5.1 3.5
## 6 setosa Sepal.Width Sepal.Width 3.5 3.5

Note that the data is now sixteen times larger, which I admit is perverse.

If I didn’t care about how the individual subplots were arranged, I’d be done: I’d plot y vs x, and facet_wrap on pair_key. But I want the subplots arranged in a grid. To do this I use facet_grid, which will require two key columns:

1
2
3
4
splt <- strsplit(iris_aug$pair_key, split = " ", fixed = TRUE)
iris_aug$xv <- vapply(splt, function(si) si[[1]], character(1))
iris_aug$yv <- vapply(splt, function(si) si[[2]], character(1))
head(iris_aug)
1
2
3
4
5
6
7
## Species pair_key x y xv yv
## 1 setosa Sepal.Length Sepal.Length 5.1 5.1 Sepal.Length Sepal.Length
## 2 setosa Sepal.Width Sepal.Length 3.5 5.1 Sepal.Width Sepal.Length
## 3 setosa Petal.Length Sepal.Length 1.4 5.1 Petal.Length Sepal.Length
## 4 setosa Petal.Width Sepal.Length 0.2 5.1 Petal.Width Sepal.Length
## 5 setosa Sepal.Length Sepal.Width 5.1 3.5 Sepal.Length Sepal.Width
## 6 setosa Sepal.Width Sepal.Width 3.5 3.5 Sepal.Width Sepal.Width

And now I can produce the graph, using facet_grid.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# reorder the key columns to be the same order
# as the base version above
iris_aug$xv <- factor(as.character(iris_aug$xv),
iris_aug$yv <- factor(as.character(iris_aug$yv),
 meas_vars)


ggplot(iris_aug, aes(x=x, y=y)) +
 geom_point(aes(color=Species, shape=Species)) + 
 facet_grid(yv~xv, labeller = label_both, scale = "free") +
 ggtitle("Anderson's Iris Data -- 3 species") +
 scale_color_brewer(palette = "Dark2") +
 ylab(NULL) + 
 xlab(NULL)

This pair plot has x = y plots on the diagonals instead of the names of the variables, but you can confirm that it is otherwise the same as the pair plot produced by pairs().

Of course, calling pairs() (or ggpairs(), or splom()) is a lot easier than all this, but now I’ve proven to myself that cdata with ggplot2 can do the job. This version does have a few advantages. It comes with a legend by default, which is nice. And it’s not obvious how to change the color palette in ggpairs() — I prefer the Brewer Dark2 palette, myself.

Luckily, this code is straightforward to wrap as a function, so if you like the cdata version, I’ve now added the PairPlot() function to WVPlots. Now it’s a one-liner, too.

1
2
3
4
5
6
library(WVPlots) 

PairPlot(iris, 
 colnames(iris)[1:4], 
 "Anderson's Iris Data -- 3 species", 
 group_var = "Species")

Related