Every Data Scientist must have come across a question, What is P-value and how do we use it in our statistical analysis?
At-least one question of every data science interview is about P-value and its purpose. So in this article I will be talking about P-value: Context, Process, and Purpose. Misinterpretation and abuse of statistical tests, confidence intervals and statistical power have been decried for decades but yet remain rampant. As these concept need a high attention and time, this high cognitive demand has led to an epidemic of shortcut definitions and interpretations that are simply wrong, sometimes disastrously so — and yet these misinterpretations dominate much of the scientific literature.
Statistical Testing
In most applications of statistical testing, one assumption in the model is a hypothesis that a particular effect has a specific size, and has been targeted for statistical analysis. This targeted assumption is called the study hypothesis or test hypothesis, and the statistical methods used to evaluate it are called statistical hypothesis tests. Most often, the targeted effect size is a “null” value representing zero effect (e.g., that the study treatment makes no difference in average outcome), in which case the test hypothesis is called the null hypothesis. Nonetheless, it is also possible to test other effect sizes. We may also test hypotheses that the effect does or does not fall within a specific range; for example, we may test the hypothesis that the effect is no greater than a particular amount, in which case the hypothesis is said to be a one-sided hypothesis.
 Source: Google images
Source: Google images
Much statistical teaching and practice has developed a strong (and unhealthy) focus on the idea that the main aim of a study should be to test null hypotheses. In fact, most descriptions of statistical testing focus only on testing null hypotheses, and the entire topic has been called “Null Hypothesis Significance Testing”. This exclusive focus on null hypotheses contributes to misunderstanding of tests. Adding to the misunderstanding is that many authors use “null hypothesis” to refer to any test hypothesis, even though this usage is at odds with other authors and with ordinary English definitions of “null” — as are statistical usages of “significance” and “confidence.”
A more refined goal of statistical analysis is to provide an evaluation of certainty or uncertainty regarding the size of an effect. We express such certainty in terms of “probabilities” of hypotheses. In conventional statistical methods, however, “probability” refers not to hypotheses, but to quantities that are hypothetical frequencies of data patterns under an assumed statistical model. These methods are thus called frequentist methods, and the hypothetical frequencies they predict are called “frequency probabilities” not the hypothesis probabilities (misconception).
P-Value
Hypothetical frequency called the P-value, also known as the “observed significance level” for the test hypothesis. The traditional definition of P-value and statistical significance has revolved around null hypotheses, and we treat all other assumptions that are used to calculate P-value as if they are all correct. As we are not sure about these assumptions, we will learn about a more general view of the P-value as a statistical summary of the compatibility between the observed data and what we would predict or expect to see if we knew the entire statistical model were correct.
The distance between the data and the model prediction is measured using a test statistic (such as a t-statistic or a chi-squared statistic). The P-value is then the probability that the chosen test statistic would have been at least as largeas its observed value if every model assumption were correct, including the test hypothesis. This definition embodies a crucial point lost in traditional definitions: In logical terms, the P-value tests all the assumptions about how the data were generated (the entire model), not just the targeted hypothesis it is supposed to test (such as a null hypothesis).
By getting a small P-value we can say that the data is more unusual if all the assumptions are correct; but a very small P-value does not tell us anything about the assumptions validity. Lets take an example, when a P-value is very small because of the false hypothesis target; but it can be small because of the study protocol violation, or maybe it was analyzed with incorrect data. And Conversely, a large P-value indicates that the data are not unusual under the statistical model, but it does not tell us anything about the model validity and assumptions. It can be large because of the study protocol violation, or maybe it was analyzed with incorrect data or just for presentation purpose to make a valid point.
A best way to build a good statistical model is by calculating confidence intervals and nowadays many journals requires confidence intervals.
This exclusive focus on null hypotheses in testing not only contributes to misunderstanding of tests and under appreciation of estimation , but also obscures the close relationship between P-values and confidence intervals, as well as the weaknesses they share.
References
I have made short notes from following resources.
The ASA’s Statement on p-Values: Context, Process, and Purpose
Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations