Explicit Matrix Factorization: ALS, SGD, and All That Jazz

In my last post, I described user- and item-based collaborative filtering which are some of the simplest recommendation algorithms. For someone who is used to conventional machine learning classification and regression algorithms, collaborative filtering may have felt a bit off. To me, machine learning almost always deals with some function which we are trying to maximize or minimize. In simple linear regression, we minimize the mean squared distance between our predictions and the true values. Logistic regression involves maximizing a likelihood function. However, in my post on collaborative filtering, we randomly tried a bunch of different parameters (distance function, top-k cutoff) and watched what happened to the mean squared error. This sure doesn’t feel like machine learning.

To bring some technical rigor to recommender systems, I would like to talk about matrix factorization where we do, in fact, learn parameters to a model in order to directly minimize a function. Explanations of matrix factorization often start with talks of “low-rank matrices” and “singular value decomposition”. While these are important for a fundamental understanding of this topic, I don’t find math-speak to be too helpful in understanding the basic concepts of various algorithms. Let me simply state the assumptions that basic matrix factorization makes.

Initial assumptions¶

We will use the same example data from the previous post, so recall that we had a user-by-item matrix where nonzero elements of the matrix are ratings that a user has given an item. Matrix factorization assumes that:

  • Each user can be described by k attributes or features. For example, feature 1 might be a number that says how much each user likes sci-fi movies.

  • Each item (movie) can be described by an analagous set of k attributes or features. To correspond to the above example, feature 1 for the movie might be a number that says how close the movie is to pure sci-fi.

  • If we multiply each feature of the user by the corresponding feature of the movie and add everything together, this will be a good approximation for the rating the user would give that movie.

That’s it. The beauty is that we do not know what these features are. Nor do we know how many (k) features are relevant. We simply pick a number for k and learn the relevant values for all the features for all the users and items. How do we learn? By minimizing a loss function, of course!

We can turn our matrix factorization approximation of a k-attribute user into math by letting a user u take the form of a k-dimensional vector $\textbf{x}{u}$. Similarly, an item *i* can be represented by a *k*-dimensional vector $\textbf{y}{i}$. User u’s predicted rating for item i is just the dot product of their two vectors

where $\hat r_{ui}$ represents our prediction for the true rating $r_{ui}$, and $\textbf{y}{i}$ ($\textbf{x}{u}^{\intercal}$) is assumed to be a column (row) vector. These user and item vectors are often called latent vectors or low-dimensional embeddings in the literature. The k attributes are often called the latent factors. We will choose to minimize the square of the difference between all ratings in our dataset ($S$) and our predictions. This produces a loss function of the form

Note that we’ve added on two $L_{2}$ regularization terms at the end to prevent overfitting of the user and item vectors. Our goal now is to minimize this loss function. Derivatives are an obvious tool for minimizing functions, so I’ll cover the two most popular derivative-based methods. We’ll start with Alternating Least Squares (ALS).

Alternating Least Squares¶

For ALS minimiztion, we hold one set of latent vectors constant. For this example, we’ll pick the item vectors. We then take the derivative of the loss function with respect to the other set of vectors (the user vectors). We set the derivative equal to zero (we’re searching for a minimum) and solve for the non-constant vectors (the user vectors). Now comes the alternating part: With these new, solved-for user vectors in hand, we hold them constant, instead, and take the derivative of the loss function with respect to the previously constant vectors (the item vectors). We alternate back and forth and carry out this two-step dance until convergence.

Derivation¶

To explain things with math, let’s hold the item vectors ($\textbf{y}{i}$) constant and take the derivative of the loss function with respect to the user vectors ($\textbf{x}{u}$)

A couple things happen above: let us assume that we have $n$ users and $m$ items, so our ratings matrix is $n \times m$. We introduce the symbol $Y$ (with dimensioins $m \times k$) to represent all item row vectors vertically stacked on each other. Also, the row vector $\textbf{r}_{u}$ just represents users u’s row from the ratings matrix with all the ratings for all the items (so it has dimension $1 \times m$). Lastly, $I$ is just the identity matrix which has dimension $k \times k$ here.

Just to make sure that everything works, let’s check our dimensions. I like doing this with Einstein notation. Although this seems like an esoteric physics thing, there was a reason it was invented - it makes life really simple! The basic tenant is that if one observes a variable for a matrix index more than once, then it is implicitly assumed that you should sum over that index. Including the indices in the last statement from the derivation above, this appears as the following for a single user’s dimension k

When you carry out all the summations over all the indices on the right hand side of the above statement, all that’s left are u’s as the rows and k’s as the columns. Good to go!

The derivation for the item vectors is quite similar

Now that we have our equations, let’s program this thing up!

Computation¶

Just like the last post, we’ll use the MovieLens dataset. I’ll gloss over this part because it’s been previously covered. Below, I import libraries, load the dataset into memory, and split it into train and test sets. I’ll also create a helper function in order to easily calculate our mean squared error (which is the metric we’re trying to optimize).