In this post we’re going to do a bunch of cool things following up on the last post introducing implicit matrix factorization. We’re going to explore Learning to Rank, a different method for implicit matrix factorization, and then use the library LightFM to incorporate side information into our recommender. Next, we’ll use scikit-optimize to be smarter than grid search for cross validating hyperparameters. Lastly, we’ll see that we can move beyond simple user-to-item and item-to-item recommendations now that we have side information embedded in the same space as our users and items. Let’s go!
History lesson¶
When I started working at Birchbox, I was tasked with finding a way to incorporate our rich customer information into the implicit feedback matrix factorization recommendation system. I had no idea what I was doing, so I did a lot of google searching. This proved a difficult problem because there are commonly two paradigms of recommender systems - content-based approaches, like when one has demographic customer data and uses this data to find other similar customers, and “ratings-based” approaches, like when one has data on what each user rated each item they interacted with. The desire was for me to marry these two approaches.
Reading through the classic survey article, Matrix Factorization Techniques for Recommender Systems (pdf link), revealed a section titled “Additional Input Sources”. This contains one such approach to incorporating so-called “side information” into a recommender system. The idea was relatively simple (well, simple relative to the initial hurdle of wrapping one’s head around matrix factorization). Let’s say that in a regular matrix factorization model, a user $u$ is represented by a single vector $\textbf{x}_{u}$ containing that user’s latent factors (see previous posts for more background on this). Now, let’s say that we have some demographic information about that user like
Feature | Value |
---|---|
gender | Female |
age_bucket | 25-34 |
income_bucket | $65-79K |
We can one-hot-encode each of these features into an “attribute-space” $A$ and assume that each each attribute $a$ has its own latent vector $\textbf{s}{a}$. Lastly, we make the assumption that our “total” user vector is the original user vector $\textbf{x}{u}$ plus each of the relevant attribute vectors. If $N(u)$ represents the set of attributes that pertain to user $u$, then the total user vector is
One can make the same set of assumptions for side information about the items, and now you can not only achieve better results with your recommendations, but you can also learn vectors and consequently similarities between the side information vectors. For a general overview of this, see a post I wrote for Dia’s tech blog.
Ok, the approach is clear, so presumably we just add this to last post’s implicit feedback objective function and solve, right? Well, I ran the math, and unfortunately this is not scalable. With the extra set of side information vectors, last post’s Alternating Least Squares (ALS) becomes a three-way alternating problem. There was a special trick in that ALS optimization that exploited the sparsity of the data to scale the computation. Alas, this trick cannot be used during the stage of ALS where one is solving for the side information vectors.
So what to do now?
Learning to Rank - BPR¶
It turns out that there is another method of optimizing implicit feedback matrix factorization problems which is neither ALS nor conventional stochastic gradient descent (SGD) on last post’s objective function. This method of optimization typically goes by the name Learning to Rank and originated in information retrieval theory.
A classic method of using Learning to Rank with implicit feedback was in the paper BPR: Bayesian Personalized Ranking from Implicit Feedback (pdf link) first-authored by Steffen Rendle who is kind of a badass in all things implicit and factorized. The idea is centered around sampling positive and negative items and running pairwise comparisons. Let’s assume that for this example our dataset consists of the number of times users have clicked on various items on a website. BPR proceeds as follows (in simplified form):
-
Randomly select a user $u$ and then select a random item $i$ which the user has clicked on. This is our positive item.
-
Randomly select an item $j$ which the user has clicked on $fewer$ times than item $i$ (this includes items that they have never clicked). This is our negative item.
-
Use whatever equation you want to predict the “score”, $p_{ui}$, for user $u$ and positive item $i$. For matrix factorization, this may be $\textbf{x}{u} \cdot \textbf{y}{i}$.
-
Predict the score for user $u$ and negative item $j$, $p_{uj}$.
-
Find the difference between the positive and negative scores $x_{uij} = p_{ui} - p_{uj}$.
-
Pass this difference through a sigmoid and use it as a weighting for updating all of the model parameters via stochastic gradient descent (SGD).
This method seemed kind of radical to me when I first read it. Notably, we do not care about the actual value of the score that we are predicting. All we care about is that we rank items which the user has clicked on more frequently higher than items which the user has clicked on fewer times. And thus, our model “learns to rank” :)
Because this model employs a sampling-based approach, the authors show that it can be quite fast and scalable relative to other, slower methods. Additionally, the authors argue that BPR directly optimizes the area under the ROC curve (AUC) which could be a desirable characteristic. Most importantly for me, it allows one to easily add in side information without blowing up the computational speed.
Learning to Rank - WARP¶
A close relative of BPR is Weighted Approximate-Rank Pairwise loss (WARP loss) first introduced in WSABIE: Scaling Up To Large Vocabulary Image Annotation by Weston et. al. WARP is quite similar to BPR: you sample a positive and negative item for a user, predict for both, and take the difference. In BPR you then make the SGD update with this difference as a weight. In WARP, you only run the SGD update if you predict wrong, i.e. you predict the negative item has a higher score than the positive item. If you do not predict wrong, then you keep drawing negative items until you either get the prediction wrong or reach some cutoff value.
The authors of the WARP paper claim that this process shifts from optimizing AUC in BPR to optimizing the precision @ k. The seems likely more relevant for most recommender systems, and I usually have better luck optimizing WARP loss over BPR loss. WARP does introduce two hyperparameters, though. One is the margin which determines how wrong your prediction must be to implement the SGD update. In the paper, the margin is 1 meaning you must guess $p_{uj} \gt p_{ui} + 1$ to implement the SGD update. The other hyperparameter is the cutoff which determines how many times you’re willing to draw negative samples trying to get a wrong prediction before you give up and move on to the next user.
This sampling approach causes WARP to run quickly when it starts training because you easily get predictions wrong with an untrained model. After training, though, WARP will run slower because it has to sample and predict many items until it gets a wrong prediction and can update. Sounds like a positive problem if it’s hard to get your model to predict wrong, though.
LightFM¶
Back to my Birchbox story. I originally implemented BPR with side information in pure numpy and scipy. This proved to be quite slow, and around that time the LightFM package was open sourced by the people at Lyst. LightFM is written in Cython and is paralellized via HOGWILD SGD. This blew my code out of the water, and I happily switched over to LightFM.
LightFM uses the same method as above to incorporate side information - it assumes that the total “user vector” is the sum of each of the user’s relevant side information vectors (which it calls user “features) and treats the items analogously. Above, we assumed that we had two types of latent vectors: $\textbf{x}{u}$ and $\textbf{s}{a}$. LightFM treats everything as side information, or features. If we want to have a specific user vector for user $u$, then we must one-hot-encode that as a single feature for that user.
Let’s go ahead and install LightFM, read in our good old Sketchfab “likes” data as well as the model tags, and see if we can get better results with LightFM than we did last post using classic ALS.
Installation¶
LightFM is on pypi, so you can install simplest with pip:
If you are on a mac, then you will unfortunately not be able to run your code in parallel out of the box. If you would like to use parallelization, you must first make sure you have gcc which can be installed with brew
:
Be careful, this takes like 30 minutes. The simplest way I’ve then found to building LightFM is to trick it. First clone the repository
Then, open setup.py
go to where the variable use_openmp
is defined, and hard set it to True
. Then, cd lightfm && pip install -e .
With all that done, let’s write some code to train some models.
Data preprocessing¶
I took a lot of the functions used last time for arranging the Sketchfab data into a matrix and placed them all in a helpers.py
file in the rec-a-sketch repo.