Collaborative Filtering using Alternating Least Squares

Collaborative filtering is commonly used in recommender systems. The idea is if you have a large set of item-user preferences, you use collaborative filtering techniques to predict missing item-user preferences. For example, you have the purchase history of all users on an eCommerce website. You use collaborative filtering to recommend which products a user might purchase next. The key assumption here is people that agreed in the past (purchased the same products) will agree in the future.

In this post I will look at how to perform Matrix Factorisation via Alternating Least Squares as a method of Collaborative Filtering. This method gave one of the best results of a single method on the Netflix dataset.

The basic idea behind any matrix factorisation method is to find a smaller set of latent factors that can be used to predict missing entries

 

In the above example, we want to decompose the purchases matrix into a set of user latent factors and a set of product latent factors.

Usually we are not able to perform this factorisation using direct methods (e.g. SVD) as the input matrix contains mostly missing values. To solve this we treat the latent factors as parameters to be learned and treat the factorisation as an optimisation problem.

One major distinction is the type of data we have available as this will affect how we model the data. Broadly speaking it is split into explicit and implicit data.

Examples of explicit data collection include the following:

  • Asking a user to rate an item on a sliding scale.

  • Asking a user to rank a collection of items from favourite to least favourite.

  • Presenting two items to a user and asking him/her to choose the better one of them.

  • Asking a user to create a list of items that he/she likes.

Examples of implicit data collection include the following:

  • Page views in an online store

  • Online purchases online.

  • Obtaining a list of items a users listens, watches on their computer. Apps they have installed.

Generally companies wishing to solve this sorts of problems have large amounts of implicit data available. I will present how the Alternating Least Squares (ALS) method can be used for explicit data and then extend it to implicit data.

Lets assume we have a number of users, and a number of items. We refer to users using the syntax (u) and(v) and refer to items with (i) and(j).

The users and items are associated through (r_{u,i}) which we will call observations. For explicit datasets this would be the ratings a user has given an item. Higher numbers mean stronger preference. For implicit datasets these would indicate user actions. For instance it might be the number of times a user has purchased an item or viewed an item online.

We will associate each user (u) with a user-factor vector (x_u) and each item with a item-factor vector (y_i). Our aim is to predict our observations from the latent factors:

We want our predictions to be as close to the ground truth as possible. Therefore to find the latent vectors we frame it as an optimisation problem using a standard squared loss with regularisation:

Let’s group all the user-factors in an (m \times f) matrix (X) and all the item-factors in an (n \times f) matrix (Y). If we assume either the user-factors or item-factors was fixed, this would look exactly like a regularised least squares problem. So, first lets assume (Y) is fixed and solve for (x_u):

Likewise if we assume (X) is fixed, and solve for (y_i):

The idea is to iterate these two steps until some stopping criterion is reached. The original paper suggested 10 iterations.

For further ways to model explicit data, see this paper on the Netflix prize.

Example python implementation

Using the raw observations directly has been found not to work as well on implicit data . Instead we define a set of binary observation variables:

Now we want to define some confidence levels for each (p_{u,i}). When (r_{u,i} = 0) we have low confidence, it may be the user has never been exposed to that item or it may be unavailable at the time. When (r_{u,i}) is small, it could be explained by the user buying a gift for someone else etc., hence we should still have low confidence. When (r_{u,i}) is larger, we should have much more confidence.

We have complete freedom how we define confidence. The original proposal was

We have some level of confidence for each (p_{u,i}). The (\alpha) parameter is tuneable, to the dataset via cross-validation.

Our new optimisation function is

We define the diagonal matrix (C^i), where (C^i_{u,u}=c_{u.i}). The parameter learning is the same as a standard weighted least squares setup:

Computing (X^{\top}C^iX) can be expensive. One optimisation can be used is the fact: (X^{\top}C^iX=X^{\top}X +X^{\top}(C^i – I)X). You can pre-compute (X^{\top}X) at each step, and notice that ( (C^i – I)) only has the non-zero entries where (r_{u,i}) was non-zero.

You might wonder, why bother with the confidence weights and why not scale the ratings instead? The problem there is that you turn a very sparse problem into a very dense problem and you will have trouble even numerically computing the MSE on the dense problem.

Example python implementation.

In the last section we will discuss a number of points when applying this algorithms on practical problem.

Convergence

Let’s look at the loss function again

The loss is non-convex in ((X, Y)). When we assume (Y) is fixed, (\mathcal{L}) is now convex in (X). Each step will move us toward a minimum, but not necessarily a global minimum.

This will mean that each time you the algorithm you will potentially get a different result depending on how you initialise (X) or (Y). It is worth running the algorithm several times when testing different hyperparameter values.

Also note, if you make (\lambda) large enough you will converge to the same solution each time. This is because are constraining your solution space.

Initialisation

As the problem is non-convex, initialising the parameters in a good way can help you converge to a solution faster.

One suggested approch, currently used in Spark:

Choose a unit vector uniformly at random from the unit sphere, but from the “first quadrant” where all elements are nonnegative. This can be done by choosing elements distributed as Normal(0,1) and taking the absolute value, and then normalising.

One interesting thought here is that better initialisations could start you off closer to a global minimum. I would welcome any comments if anyone is aware of better approaches, I’ve not seen any literature on this.

Hyperparameters

Exact values of (\lambda) and (\alpha) are data dependent and should be determined  by cross-validation.

Intercepts

Interestingly, the original paper does not suggest using any form of user or item biases. A better approach could be to define:

where (\mu) would be a global average, (\beta_u) represents a user bias and (\beta_i) represents an item bias. These biases help us account for the difference in users and items e.g. some users buying lots of products.

The global average (\mu) makes sense with explicit data (the average overall rating). It may not sense to use with implicit data (where most of (r_{u,i}=0) ).

Regularisation

Zhou et al. suggest weighting each term in the regularisation by (n_u) and (n_i) the number of observations of user (u) and items (i) respectively. The only reason I can see do to this is that it makes (\lambda) less dependent on the scale of the dataset, so (\lambda) from a sampled dataset should work well on a full dataset.

Stopping Criteria

When to stop the algorithm is usually governed by:

  • If a maximum number of iterations is met

  • If the difference between the measured MSE of the current iteration and previous iteration goes below some epsilon

Confidence Functions

You are free to use any function for the confidence weight you like. Straightforward linear scaling might not be what you want. For instance, in a e-Commerce scenario, you might use weights along the lines of

  • 3 for a product page view

  • 10 for an add to basket

  • 40 for a product purchase

  • etc.

 ALS versus SGD

The loss function (\mathcal{L}) could be optimised via stochastic gradient descent. It would potentially be easier and faster than ALS. However ALS offers two advantages:

  • ALS is much easier to parallelise.

  • In my experience, ALS converges very quickly. Generally I find less than 10 iterations will converge to a decent solution.

Hopefully this post provides an overall introduction to ALS and highlights some of the considerations you should take into account when implementing this algorithm.