Our group has done a lot of work with non-standard calling conventions in R
.
Our tools work hard to eliminate non-standard calling (as is the purpose of wrapr::let()
), or at least make it cleaner and more controllable (as is done in the wrapr dot pipe). And even so, we still get surprised by some of the side-effects and mal-consequences of the over-use of non-standard calling conventions in R
.
Please read on for a recent example.
Consider the following calls to stats::lm()
. And notice the third example fails (throws an error).
According the stats::lm()
documentation (help(lm)
) the first argument must be:
an object of class “formula” (or one that can be coerced to that class): a symbolic description of the model to be fitted. The details of model specification are given under ‘Details’.
A string appears to be coerce-able into a formula, so all three examples should work. However, typing “print(lm)
” reveals the issue: stats::lm()
doesn’t take the “weights
” argument in a standard way (as the value of a function parameter). It instead grabs it through a sequence of match.call()
and eval()
steps. It is a complicated way to get the value, which works until it does not work. Somehow passing in the formula as a string interferes with how the value of weights
is found. I think we can now see the benefits of isolation and independence of concerns in code.
This over-use of direct environment copying and manipulation is what leads to a great many data-leaks in stats::lm()
and stats::glm()
. This is in addition to their weird habit of keeping a copy of all of the training data (which loses quite a few of the merits of these methods). Our group dealt with these issues a long time ago, so we are somewhat familiar with stats::lm()
and stats::glm()
.
Of course, one could (as the stats::lm()
documentation mentions) call stats::lm.fit()
. However, stats::lm.fit()
does not seem to accept weights and its own documentation (help(lm.fit)
) starts ominously:
These are the basic computing engines called by lm used to fit linear models. These should usually not be used directly unless by experienced users.
Having just finished teaching a four day intensive course covering data science in Python, I can’t help but remark that users of sklearn.linear_model.LinearRegression()
don’t need to worry about issues such as the above. Some of the notational flair of R
comes at the cost of significant opportunities for user confusion.
Related