Someone pointed to this post from a couple years ago by Uri Simonsohn, who correctly wrote:
Robustness checks involve reporting alternative specifications that test the same hypothesis. Because the problem is with the hypothesis, the problem is not addressed with robustness checks.
Simonsohn followed up with an amusing story:
To demonstrate the problem I [Simonsohn] conducted exploratory analyses on the 2010 wave of the General Social Survey (GSS) until discovering an interesting correlation. If I were writing a paper about it, this is how I may motivate it:
Based on the behavioral priming literature in psychology, which shows that activating one mental construct increases the tendency of people to engage in mentally related behaviors, one may conjecture that activating “oddness,” may lead people to act in less traditional ways, e.g., seeking information from non-traditional sources. I used data from the GSS and examined if respondents who were randomly assigned an odd respondent ID (1,3,5…) were more likely to report reading horoscopes.
The first column in the table below shows this implausible hypothesis was supported by the data, p<.01 (STATA code)
People are about 11 percentage points more likely to read the horoscope when they are randomly assigned an odd number by the GSS. Moreover, this estimate barely changes across alternative specifications that include more and more covariates, despite the notable increase in R2.
Simonsohn titled his post, “P-hacked Hypotheses Are Deceivingly Robust,” but really the point here has nothing to do with p-hacking (or, more generally, forking paths).
Part of the problem is that robustness checks are typically done for purpose of confirming one’s existing beliefs, and that’s typically a bad game to be playing. More generally, the statistical properties of these methods are not well understood. Researchers typically have a deterministic attitude, identifying statistical significance with truth (as for example here).