Someone writes:
Care to comment on this paper‘s Figure 4? I found it a bit misleading to do scatter plots after averaging over multiple individuals. Most scatter plots could be “improved” this way to make things look much cleaner than they are. People are already advertising the paper using this figure.
The article, Genetic analysis of social-class mobility in five longitudinal studies, by Daniel Belsky et al., is all about socioeconomic status based on some genetics-based score, and here’s the figure in question:
These graphs, representing data from four different surveys, plotting SES vs. gene score, with separate panels for family SES during childhood), show impressive correlations, but they’re actually graphs of binned averages. I agree with my correspondent that it would be better to show one dot per person here rather than each dot representing an average of 10 or 50 people. Binned residual plots can be useful in revealing systematic departures from a fitted model, but if you’re plotting the data it’s cleaner to plot the individual points, not averages. Plotting averages makes the correlation appear visually to be larger than it is.
My only concern is that the socioeconomic index (the y-axis on these graphs) might be discrete, in which case if you plot the raw data they will fall along horizontal bands, making the graph harder to interpret. You could then add jitter to the points, but then that’s introducing a new level of confusion.
So no easy answer, perhaps? Binned residuals misleadingly make the pattern look too clean, but raw data might be too discrete to plot.
Other than the concerns about binning, I think this graph is just great. Excellent use of small multiples, informative, clean, etc. A wonderful example of scientific communication.