On the consistency of ordinal regression methods

My latests work (with Francis Bach and Alexandre Gramfort) is on the consistency of ordinal regression methods. It has the wildly imaginative title of “On the Consistency of Ordinal Regression Methods” and is currently under review but you can read the draft of it on ArXiv. If you have any thoughts about it, please leave me a comment!

Update July 2017: this paper was published on the Journal of Machine Learning Research. The published version can be found here

Ordinal what?

The problem of ordinal regression is an old one in supervised learning. Its roots can be traced back to the works of McCullagh[] in the 80s. It is a supervised learning problem that shares properties –yet is fundamentally different– with both multiclass classification and regression. It can be seen as the problem of predicting a target variable from labeled observations, where the target label consists of discrete and ordered labels. As in the multiclass classification setting, the target variables are of discrete nature, and as in the regression setting (but unlike the multiclass setting) there is a meaningful order between the classes.

The most popular example of ordinal regression arise when the target variable is a human generated rating. For example, for a movie recommendation system, the target variable can have the possible values “do-not-bother” ≺ “only-if-you-must” ≺ “good” ≺ “verygood” ≺ “run-to-see”. Using multiclass classification to predict this target would yield a suboptimal classifier since it ignores the fact that there is a natural ordering between the labels. On the other hand, a regression algorithm assumes that the target variable is continuous, while here it is clearly discrete. Ordinal regression would be the ideal model for this target variable.

Fisher consistency

The notion of Fisher consistency is also an old one in statistics, and goes back to the work of Fisher at the beginning of the 20th century. The rigorous definition is stated in the paper, so here I’ll just give an intuition.

In supervised learning, we observe random samples (in the form of pairs (target, sample) usually denoted $(y_i, X_i)$ ) from a population (lets call it P) and build a model that predicts the target when he seems a new sample. Fisher consistency can be seen as a sanity check on the learning model, that states that if instead of seeing a random sample we would have access to the full population P (which in real life never happens), then our classifier would have an accuracy as good as the best possible accuracy (such classifier is usually called Bayes rule or Bayes predictor).

Having Fisher consistency is an important property that “allows us to design good loss functions with desirable properties”[]. Because of this, in the last decade the Fisher consistency of most used supervised learning methods has been investigated. It has been shown (see e.g.[]) that most common methods for binary classification are consistent. For the multiclass case and ranking, the situation is more interesting, with some methods that are known to be inconsistent, such as one-vs-all SVM in multiclass classification and RankSVM.

The study of Fisher consistency for ordinal regression methods has been done for the first time (to the best of my knowledge) here and proves that despite the negative results of multiclass classification[][] and ranking[][], common ordinal regression methods are Fisher consistent. This brings ordinal regression closer to binary classification than to multiclass classification in this respect. And in fact, some results in the paper can be seen as generalization of known results for binary classification.

Highlights

In the paper we study the Fisher consistency of some popular ordinal regression methods. The methods that we analyze are the following (see Table 1 in the paper of a definition): all threshold, cumulative link, immediate threshold and last absolute deviation. In the paper we present the following results

We characterize the consistency of all threshold method and immediate threshold by the derivative at zero of an auxiliary function.
We provide an excess risk bound of the all threshold loss.
Prove consistency of the least absolute deviation. This was already done for the case of three classes by Ramaswamy et al.[], here we extend the proof to an arbitrary number of classes.
Prove consistency of cumulative link model (a model that includes the venerable proportional odds model of McCullagh).
The consistency analysis suggest a novel loss function when optimizing a least squares metric. We test this novel model on different datasets and report that it performs very competitively.