Statistical Assessments of AUC

In the scorecard development, the area under ROC curve, also known as AUC, has been widely used to measure the performance of a risk scorecard. Given everything else equal, the scorecard with a higher AUC is considered more predictive than the one with a lower AUC. However, little attention has been paid to the statistical analysis of AUC itself during the scorecard development.

While it might be less of a concern to rely on a simple comparison of AUC for the model selection in the development stage and then to pick the scorecard with a higher AUC, more attention should be called for on AUC analysis in the post-development stage. For instance, the senior management would need to decide whether it is worthy to retire a legacy scorecard that might be still performing and to launch the full-scale deployment of a new scorecard just for an increase in AUC that might not even be statistically significant. While the claim of certain business benefits can always be used as an argument in favor of the new scorecard, the justification would become even more compelling with a solid statistical evidence. What’s more, the model validation analyst might also want to leverage the outcome of AUC analysis to ensure the statistical soundness of new scorecards.

In the example below, two logistic regressions were estimated with AUC = 0.6554 and BIC = 6,402 for the model with 6 variables and AUC = 0.6429 and BIC = 6,421 for the model with 3 variables.

Both AUC and BIC statistics seemed to favor the first model. However, is a 2% difference in AUC significant enough to infer a better model? Under the Null Hypothesis of no difference in AUC, three statistical tests were employed to assess the difference in AUC / ROC between two models.

Based upon the above output, there is no strong statistical evidence against the Null Hypothesis.

In addition, a Vuong test is also performed, supporting no difference between two models after corrected for the Schwarz penalty.

Related