R Tip: Put Your Values in Columns

Today’s R tip is: put your values in columns.

Some R users use different seemingly clever tricks to bring data to an analysis.

Here is an (artificial) example.

Notice: one of the variables came from a vector in the environment, not from the primary data.frame. chamber_sizes was first looked for in the data.frame, and then in the environment the formula was defined (which happens to be the global environment), and (if that hadn’t worked) in the executing environment (which is again the global environment).

Our advice is: do not do that. Place all of your values in columns. Make it unambiguous all variables are names of columns in your data.frame of interest. This allows you to write simple code that works over explicit data. The style we recommend looks like the following.

The only difference is we took the time to place the derived vector into the data frame we are working with (assigned to mtcars$chamber_sizes instead of the global environment in the first line). This is a very organized way to work, and as you see it does not take much effort.

Or use only existing values, as we show below.

This is something we teach: with some care you can reliably treat variables as strings, and this is in no way inferior to complex systems such as stats::formula or rlang::quosure. The fact that these objects cary around an environment in addition the names is in fact a barrier to reliable code, not an unmitigated advantage.

I am not alone in this opinion.

If the formula was typed in by the user interactively, then the call came from the global environment, meaning that variables not found in the data frame, or all variables if the data argument was missing, will be looked up in the same way they would in ordinary evaluation. But if the formula object was precomputed somewhere else, then its environment is the environment of the function call that created it. That means that arguments to that call and local assignments in that call will define variables for use in the model parent (that is, enclosing) environment of the call, which may be a package namespace. These rules are standard for R, at least once one knows that an environment attribute has been assigned to the formula. They are similar to the use of closures described in Section 5.4, page 126.

Where clear and trustworthy software is a priority, I would personally avoid such tricks. Ideally, all the variables in the model frame should come from an explicit, verifiable data source, typically a data frame object that is archived for future inspection (or equivalently, some other equally well-defined source of data, either inside or outside R, that is used explicitly to construct the data for the model).