In my last article, I presented a brief overview of the three major kinds of recommendation systems: contentbased methods, collaborative filtering, and hybrid models. I also introduced a recent model by the fashion company Lyst called Light FM, which combines product metadata, customer information, and transaction history through a latent space to come up with recommendations for customers.
What I liked about Light FM is that it makes use of all available information from customers, from products, and from transactions to come up with a recommendation. Since the Lyst paper is a bit esoteric on the math (if you don’t have a mathy background), I think it makes sense to explain the model in an intuitive manner.
For simplicity’s sake, let’s assume that we want to build a recommendation system for movies (similar to Netflix). The Light FM model asks three things of us:

information on our movies. For example, year released, lead actor or actress, genre

information on our customers. For example, age, sex, civil status

information on movies the customers watched. Did Sally watch ‘mother!’?
Given these three pieces of information, Light FM is able to give a recommendation. The contentbased part here is due to the first two bullets. We represent the movies and customers with a set of distinct features. We use these features to describe the items, similar to a contentbased system. The third bullet is the collaborativefiltering (CF) component. CF models tend to work well even if we don’t have information on our products or customers, granted we have tons of transaction data. But in the case of a new website, there are only a few transactions, and CF filters tend to do bad (i.e. the cold start problem).
So how does the Light FM model integrate these three pieces of information?
In a usual CF model, we “discover” latent features which characterize each movie based on the customermovie interactions (which we collect into a vector called m), and we “discover” the affinity of each customer to each latent feature discovered (which we collect into a vector called c). This is done through a process called nonnegative matrix factorization (NMF). The pairing score given to a pair of customer and movie is measured as simply the cosine similarity (basically the overlap) between vectors m and c.
The NMF procedure is an example of an embedding model. An embedding model finds hidden (or latent) structure from the data. So NMF might discover that Deadpool is 30% action, 30% superhero and 40% comedy, and that Sally likes 10% action, 20% superhero and 70% comedy. Then the SallyDeadpool score is the overlap between the two vectors. Movie recommendations for Sally can be made by computing the similarity of Sally’s latent features to all movies and then picking the five movies with the highest similarity.
In the case of Light FM, instead of finding latent representations of movies and customers directly, we instead get the latent representation of each feature for every movie and customer. Following this idea, the latent representation of a movie is just the sum of the latent representations of the movie’s features. Similarly for customers, we just add the latent representations of the customer’s features to get the latent representation for a customer. The score for a moviecustomer pair is, again, the cosine similarity of the latent representations of the movie and the customer.
For example, if we only define three features for our movie set, say scariness, funniness, and “drama”ness (which are measured from 1 to 1), the Light FM model might discover the latent features that (1) relate to the color profile of the movie (say 1 is monochrome and 1 is super colorful) and (2) relate to the number of deaths in the movie (again, 1 for no deaths and 1 for many deaths).
Suppose that we feed the model the data that we have and wait for it to finish running. The model is then able to infer that the latent representation for scariness is (0.8, 0.9) since scary movies tend to have very muted colors and many deaths, for funniness (0.9, 0.7), and for dramaness (0.6, 0.9).
So for example, Before Sunrise in terms of its features could be given a representation (0, 0.5, 0.8). And so, its latent representation in terms of Light FM is the weighted sum of the latent vectors of its features. That is, 0(0.8, 0.9) + 0.5(0.9, 0.7) + 0.8*(0.6, 0.9) = (0.93, 0.37). Thus it seems that based on our representations, Before Sunrise is quite colorful and not really a lot of death.
We can do the same for the users. The features of each user are represented in terms of a latent vector that contains the affinity of the user to color and deaths. The user is then represented as the sum of the latent vectors of each of its features.
Thus, we can measure the affinity of the user to a specific movie by getting the cosine similarity of their latent representations.
As I mentioned in the last article, a very good reference for Recommender Systems is the book by Aggrawal.