Introduction
There are lots of articles about training and evaluating recommenders but few explain how to overcame the challenges involved in a setting up a full-scale system.
Most libraries don’t support scalable production systems out of the box. The challenges usually are:
-
Predicting dynamically – when you have a very large user/items dimensionality – it can be very inefficient, or impossible, to precompute all recommendations.
-
Optimizing response times - When you created predictions dynamically, the time you need to retrieve them is very important.
-
Frequently updating models - When the system needs to incorporate new data as it becomes available – it’s crucial to frequently update the models.
-
Predicting on unseen data - How to deal with unseen users or items, and how to deal with continously changing features.
We will show how you can modify a model to extend its functionality for a full-scale production environment.
Hybrid recommender models can deal better with real-world challenges
We use a lightFM model, a very popular python recommendation library that implements a hybrid model. It is suited for small to middlesized recommender projects – where you don’t require distributed training.
Short recap on the different recommender approaches
There are two basic approaches to recommendation:
Collaborative models, use only collaborative information – implicit or explicit interactions of users with items (e.g. movies watched, rated or liked). They use no information on the actual items (like movie category, genre, etc.)
Collaborative models can reach high precision with little data. However, they can’t handle unkown users or items (cold start problem).
Content-based models work purely on data available about items or users. Entirely ignoring interactions between users and items. - Thus, they approach recommendations very different to collaborative models.
Content based models usually:
-
Require a much more training data (user/item examples available for almost every single user/item combination), and
-
Are much harder to tune than collaborative models.
However, they can make predictions for unseen items and usually have a better coverage compared to collaborative models.
Hybrid Recommenders – like LightFM – combine both approaches, and overcome a lot of the challenges of each individual approach.
They can deal with new items or new users:
When you deploy a collaborative model to production you’ll often run into the problem that you need to predict for unseen users or items. E.g a new user registers or visits your website (new user) – or your content team publishes a new article (item).
Usually you have to wait at least until the next training cycle or until the user interacts with some item to be able to make recommendations for these users.
However, the hybrid model can make predictions even in this case: It will simply use the partially available features to compute the recommendations.
They can deal with missing features
Often, some features are missing for some users and items (simply because you couldn’t collect them yet), which is a problem if you are relying on a content-based model.
Hybrid recommenders can perform for returning users (known from training) as well as new users/items, given you have features about them. This is especially useful for items but also for users (you could ask the users what he is interested in when he visits your site for the first time).
System Components
This system will assume that there are much less items than users, as it always retrieves predictions for all items. But it can serve as the base for more complex recommenders.
The core of the system is a flask app that receives a user id and returns the relevant items for this user. It will (re)load the lightFM model and query a redis instance for item and/or user features.
We will assume that user and item features are stored and serialized in a redis database and can be retrieved at any time by the flask app.
All applications will be deployed as microservices via docker containers.
How LightFM makes Predictions
But how does LightFm make predictions?
The LightFM paper is very informative for the academic reader but maybe a little brief for someone who is not very familiar with the domain. I’ll outline the LightFM model predicition process in a more straightforward way below.
Explanation regarding the formulas:
-
Lowercase letters refer to vectors and uppercase letters refer to matrices.
-
The subscript uuu refers to a single user and UUU refers to the set of all users. Items are referred to in the same way.
Most of the naming is consistent with the LightFM paper.
Model Components
So LightFM combines to best of the collaborative and the content-based approaches. You might say it models one component for each of the two approaches. Both are needed to give us the properties we want from the recommender.
Collaborative component
The collaborative component allows you to fall back to a collaborative filtering algorithm in case you don’t have any features – or the features are not informative.
State of the art collaborative filtering algorithms are implemented with a matrix factorization. They estimate two latent (unobserved) matrix representations, which, when multiplied with each other, will reproduce the matrix of interactions for each item and user that the model saw during training. Of course there is an error term to allow for noise and avoid overfitting.
A simple analogy: try to factorize 12. We can do this with 2 and 6, 3 and 4, 1 and 12, etc. Similar for matrices.
We will call those matrices latent represenations as they are a compressed form of our interaction data.
Content-based components
The content-based component will allow you to get predictions even if we have no interaction data.
LightFM incorporates user and item features by associating the features to the latent representations. The assumption is that features and latent representation are linearly related. Thus in vector form:
qu=fu⋅EU+buq_u = f_u \cdot E_U + b_u qu=fu⋅EU+bu
with quq_uqu being the latent user representation, fuf_ufu being a single users features row vector, EuE_uEu the estimated item embeddings bub_ubu the biases for the user features (we’ll leave them out from now on for simplicity).
Looks similar to linear regression right? Except for EUE_UEU being a matrix instead of ß, which is usually vector. In fact this actually performs multiple regressions: one for each model component. Again the its analogous for items.
During training the user embeddings as well as the item embeddings are estimated with the help of gradient descent algorithms. The embedding matrix will have a row for each feature. The columns of the embedding matrix are called components and their number is set as a model hyperparameter and from now on refered to as ddd.
The above image recaps this process for all users and all items. So in Step I we have a matrix multiplication of the user feature matrix of shape Nusers×NuserfeaturesN_{users} \times N_{user_{features}}Nusers×Nuserfeatures with the embedding matrix of shape Nuserfeatures×dN_{user_{features}} \times dNuserfeatures×d. The same applies respectively for the second multiplications of the item features with the item embeddings. The result from Step I are two matrices of shape Nusers×dN_{users} \times dNusers×d and Nitems×dN_{items} \times dNitems×d respectively. Thus representing each user/item as a latent vectors of size ddd.
In the final step those two matrices are mutliplied resulting in the final score for each user and item of shape Nusers×NitemsN_{users} \times N_{items}Nusers×Nitems.
Now you can easily get all representations for a single user with the following term:
ru=qu⋅QITr_u = q_u \cdot Q_I^T ru=qu⋅QIT
With quq_uqu being a row vector of the users latent representations, and QIQ_IQI being a matrix of all items latent representations.
Enable Fallback to Collaborative Mode with Indicator Matrices
lightfm can generate models with collaborative information only.
It uses a very effective trick: If no user or item features are used at all, the model will accept an identity matrix of the size respective to NusersN_{users}Nusers or NitemsN_{items}Nitems. This is very effective, since it then learns ddd components – one for each user. This way the model can always fall back to the best pure collaborative method. You can think of these components as the models memory about thes users and items it has already seen during training.
You can also force the model to fallback to collaborative mode – even when you do have features: You can modify the feature matrix by appending an identity matrix to it. Sometimes you’ll need this to get your model to converge. However, thi usually means your features are too noisy or don’t carry enough information for the model to converge to a minimum by itself.
Finally, using this trick, increases the effort to take the model to production: During during training the index of the user is used to retrieve the correct row of the corresponding feature/identity matrix – and this information might not be available anymore in a production environment and the LightFM model hands this responsibility off to the user as well.
Interesting Fact
The latent representations of similar items/users (in terms of collaborative information) that you can obtain by using indicator features only will be close in euclidean space. The model estimates them from collaborative information. So you can use those to get similarities between your items or users.
Recreating indicators and features on the fly
Let’s now implement a model that is able to fallback to collaborative mode, keeps track of ids and thus is able to reconstruct the correct features and indicators.
We will focus on implementing a very complete approach. This is quite complex, because at the same time it should be able to give predictions in most situations. We will subclass the LightFM class and add a special predict_online
method, which is intended to be used during production.
This way we still use lightFm’s cythonized predictions functions and avoid handling user and item id mappings separately.
It should satisfy these requirements:
-
Reconstruct indicator feature if user/item was seen during training
-
Make online predictions no matter what data is available for a certain user
-
Make those predictions as quickly as possible
ID Mappings
To achieve 1. you will have to use the same class during training as well. You also need to adjust your subclass to only accept sparsity.SparseFrame objects during training and therefore create and save id mappings.
Reconstructing features
In order to achieve requirement 2 you need to check what data is available everytime a request comes in. There is 16 cases which you have to handle:
Case | User known | User features | Item known | Item Features |
---|---|---|---|---|
I | Yes | Yes | Yes | Yes |
II | Yes | No | Yes | Yes |
III | No | Yes | Yes | Yes |
IV | No | No | Yes | Yes |
V | Yes | Yes | Yes | No |
VI | Yes | No | Yes | No |
VII | No | Yes | Yes | No |
VIII | No | No | Yes | No |
IX | Yes | Yes | No | Yes |
X | Yes | No | No | Yes |
XI | No | Yes | No | Yes |
XII | No | No | No | Yes |
XIII | Yes | Yes | No | No |
XIV | Yes | No | No | No |
XV | No | Yes | No | No |
XVI | No | No | No | No |
In cases IV, VIII and XII we simply return our baseline predictions. For the cases XIII - XVI we can’t give any predictions, because we don’t know enought about the items.
To summarize: we basically want to create a row vector which contains the user features, if they are available, otherwise it’s all zeros at the respective indices. It also contains the user indicator feature set at the correct index if the user was seen during training.
The item features are analogous to the user features, except we expect them to fit easily into memory to allow for a cache. You might consider using a different caching strategy based on your usecase (e.g TTLCache) or not cache at all.
We also want to support not adding indicators or only adding them to either user or item features which might make the implementation a little more complex. Still we tried to keep it as simple as possible.
Below you’ll find an example implementation of the approach described in this post. This implementation should handle all cases up to VIII correctly. Not all item cases might be implemented though as our application didn’t require it. E.g. predicting known items without item features is not possible but should be very easy to add.
Example Implementation
In Part II of this blogpost we will use this class, connect it to a redis database and serve it’s prediction dynamically with flask. We’ll also show you how to update the model without downtime with a background thread started from within in the flask application.