Introduction
Yesterday, Uri Simonsohn published a blog post called “Interactions in Logit Regressions: Why Positive May Mean Negative”. I love Uri’s blog in general, but I was pretty confused by this latest post. I found it very hard to understand exactly what was being claimed in the absence of formal mathematical definitions of the terms being used.
Thankfully Uri’s post linked to the original papers that inspired his post. Those papers provided clear formal definitions of what they meant when they used the term “interaction effect”. This post expands on those definitions a bit.
Defining the Existence of Interactions and the Measurement of Interaction Effects
Let’s assume that we have an outcome variable (y) that is a function of two predictor variables, (x_1) and (x_2). These assumptions allow us to write (y = f(x_1, x_2)) for some (f).
We can then define the existence of interactions between (x_1) and (x_2) in a negative way. We’ll do that by specifying an invariant that holds whenever there are no interactions. The invariant we’ll use is this: no interaction between (x_1) and (x_2) exists whenever the partial derivative of (f) with respect to one variable isn’t a function of the other variable.
To make this formal, let’s define a new function, (g), as follows:
Note that we’ve chosen to focus on the partial derivatives of (f) with respect to (x_1), but that a full treatment would need to apply this same approach to other variables.
Given this definition of (g), we can say that interactions do not exist whenever
for all values of (x_2) and (x_2^{\prime}).
In other words, the partial derivative of (f) with respect to (x_1) does not change when we evaluate it at different values of (x_2). In looser English, the way that our outcome changes in response to the first predictor does not depend upon the specific value of the second predictor.
Instead of working with derivatives, we could also define interactions in terms of second derivatives, which is what’s done in the papers Uri linked to. Those papers define a quantitative measure called the “interaction effect”, (I), as follows:
Note that this is a function of both (x_1) and (x_2) and is generally not a constant — and therefore cannot generally be summarized by a single number without engaging in lossy data compression.
Given this quantitative definition of the “interaction effect” function, we can say that no interaction exists when the second derivatives are zero everywhere. That is,
for all values of (x_1) and (x_2).
Some Examples of Interactions Measured by First and Second Derivatives
To see how these definitions play out in practice, let’s consider some simple models.
Two-Variable Linear Regression without an Interaction Term
First, let’s assume that (y = \beta_1 x_1 + \beta_2 x_2). This is a linear regression model in which there is no “interaction term”. In that case,
This derivative is a constant and is therefore clearly invariant to the value of (x_2), so we can conclude there will be no “interaction effects” — because the second derivative will be a derivative of a constant, which is always zero.
Two-Variable Linear Regression with an Interaction Term
In contrast, if we assume that (y = \beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1 x_2) we’ll observe a non-zero “interaction effect”. In this changed model, we’ve introduced an “interaction term”, (x_1 x_2). That change to the model implies that,
This derivative is clearly a function of (x_2) and so we can reasonably expect to see a non-zero “interaction effect”.
To generalize from this example: introducing what is often called “an interaction term” into a linear regression model introduces an “interaction effect” according to the definition given in the previous section. Moreover, the coefficient on the “interaction term” is exactly equal to the “interaction effect”:
But note that this equivalence of coefficients on “interaction terms” and “interaction effects” is a special property of linear regression models and does not generally hold.
But also note that, even though linear regression models have unusually simple derivatives, the special properties of linear regression are not required to prevent the introduction of non-zero “interaction effects”. The absence of interactions has nothing to do with linearity: it’s driven instead by a form of additivity. To see why, let’s consider another example in which the model is non-linear, but exhibits no “interaction effects”.
Two-Variable Generalized Additive Model
Let’s change things again and assume that (y = f(x_1) + g(x_2)). In that case,
This derivative is not a constant everywhere (which makes it very different from the linear regression model), but it is also not a function of (x_2), so there will be no “interaction effects”. In short, linearity is not necessary for the non-existence of non-zero “interaction effects”.
We can also see this by explicitly calculating the “interaction effect”,
As long as a regression model is additively separable in the sense that the outcome variable can written as a sum of univariate functions that depend upon only one predictor variable, “interaction effects” will never exist.
Examples of Non-Zero Interaction Effects in GLM’s
It is important to note that only a special form of additivity is required to prevent “interaction effects”. But the point of Uri’s post, derived from the papers he cites, is that “interaction effects” exist in logistic regression models and that they are generally not equal to the coefficients on the “interaction terms”.
He’s totally right about this. I’d even argue that he’s understating this issue by focusing on logistic regression models: the argument applies to all GLM’s that aren’t identical to linear regression. The problem is that GLM’s use an inverse link function — and the inverse link function per se introduces a deviation from additive separability.
To see why, let’s assume that (y = f(\beta_1 x_1 + \beta_2 x_2)) and that (f) is an inverse link function of the sort used in a GLM. By applying the chain rule to this GLM-style model, we see that,
This derivative is clearly a function of (x_2) and changes depending on the value of (x_2). In other words, an “interaction effect” must be non-zero for some values of (x_1) and (x_2).
Note how weak the assumptions are that I’ve had to make to introduce an “interaction effect”: I just need (f) to be distinct from the identity function used in linear regression. The inverse logit link function will introduce “interaction effects” as will the exponential inverse link function used in Poisson regression. The linear regression model is, again, totally non-representative of the other GLM models.
Conclusion
I’m glad to see Uri making a point to distinguish “interaction terms” from “interaction effects”, although I’m not sure I understand his other point about “conceptual interactions” and “mechanical interactions”. Given the importance of the first point, I thought it was worth making the ideas he was addressing more mathematically explicit.
With that in mind, I’m generally inclined to think of “interaction terms” as an old-fashioned form of feature engineering. Introducing “interaction terms” can substantially improve the predictive power of mathematical models, but they are extremely difficult to interpret as a statement about “interaction effects” because implausibly strong assumptions are required to make “interaction terms” line up with “interaction effects”. In general, I think people who are testing hypotheses about “interaction terms” are actually trying to test for the existence of any non-zero “interaction effects”, but I’m skeptical that hypothesis testing for non-zero coefficients on “interaction terms” is generally useful for testing for the existence of non-zero “interaction effects”.
In contrast, I think “interaction effects” are a much deeper and more interesting property of mathematical models, but they’re also functions of all of the original predictors and are therefore almost never the constants that occur in linear regression. There’s simply no reason to think you can easily summarize the way in which the first derivatives of models respond to changes in other predictor variables with a single number. If you’re interested in understanding the “interaction effects”, you should probably model them directly.