Approximating Implicit Matrix Factorization with Shallow Neural Networks

Implicit Matrix Factorization¶

The Implicit Matrix Factorization (IMF) algorithm as presented by Hu, Koren and Volinksy is a widely popular, effective method for recommending items to users. This approach was born from necessity: while explicit feedback as to our users’ tastes - a questionnaire completed at user signup, for example - makes building a recommendation engine straightforward, we often have nothing more than implicit feedback data - view count, click count, time spent on page, for example - which serve instead as a proxy for these preferences. Crucially, the latter feedback is asymmetric: while a high view count might indicate positive preference for a given item, a low view count cannot be said to do the opposite. Perhaps, the user simply didn’t know the item was there.

IMF begins with a ratings matrix $R$, where $R_{u, i}$ gives the implicit feedback value observed for user $u$ and item $i$. Next, it constructs two other matrices defined as follows:

$P$ gives a binary matrix indicating our belief in the existence of each user’s preference for each item. $C$ gives our confidence in the existence of each user’s preference for each item where, trivially, larger values of $r_{u, i}$ give us higher confidence that user $u$ indeed likes item $i$.

Next, IMF outlines its goal: let’s embed each user and item into $\mathbb{R}^f$ such that their dot product approximates the former’s true preference for the latter. Finally, and naming user vectors $x_u \in \mathbb{R}^f$ and item vectors $y_i \in \mathbb{R}^f$, we compute the argmin of the following objective:

Once sufficiently minimized, we can compute expected preferences $\hat{p}_{u, i} = x_u^Ty_i$ for unobserved $\text{(user, item)}$ pairs; recommendation then becomes:

For a given user $u$, compute predicted preferences $\hat{p}_{u, i} = x_u^Ty_i$ for all items $i$.
Sort the list in descending order.
Returning the top $n$ items.

Shallow neural networks¶

IMF effectively gives a function $f: u, i \rightarrow \hat{p}_{u, i}$. As before, our goal is to minimize the function above, which we can now rewrite as:

To approximate this function, I turn to our favorite universal function approximator: neural networks optimized with stochastic gradient descent.

This work is built around a toy web application I authored long ago: dotify. At present, dotify:

Pulls data nightly from Spotify Charts. These data contain the number of streams for that day’s top 200 songs for each of 55 countries.
Computes an implicit matrix factorization nightly, giving vectors for both countries and songs.
Allows the user to input a “country-arithmetic” expression, i.e. “I want music like Colombia x Turkey - Germany.” It then performs this arithmetic with the chosen vectors and recommends songs to the composite.

In this work, I first fit and cross-validate an implicit matrix factorization model, establishing the three requisite parameters: $f$, the dimensionality of the latent vectors; $\alpha$, the scalar multiple used in computing $C$; $\lambda$ the regularization strength used in on our loss function.

Network architectures¶

Next, I explore five different shallow neural network architectures in attempt to improve upon the observed results. These architectures are as follows:

A trivially “Siamese” network which first embeds each country and song index into $\mathbb{R}^f$ in parallel then computes a dot-product of the embeddings. This is roughly equivalent to what is being done by IMF.
Same as previous, but with a bias embedding for each set, in $\mathbb{R}$, added to the dot-product.
Same as #1, but concatenate the vectors instead. Then, stack 3 fully-connected layers with ReLU activations, batch normalization after each, and dropout after the first. Finally, add a 1-unit dense layer on the end, and add bias embeddings to the result. (NB: I wanted to add the bias embeddings to the respective $\mathbb{R}^f$ embeddings at the outset, but couldn’t figure out how to do this in Keras.)
Same as #2, except feed in the song title text as well. This text is first tokenized, then padded to a maximum sequence length, then embedded into a fixed-length vector by an LSTM, then reduced to a single value by a dense layer with a ReLU activation. Finally, this scalar is concatenated to the scalar output that #2 would produce, and the result is fed into a final dense layer with a linear activation - i.e. a linear combination of the two.
Same as #4, except feed in the song artist index as well. This index is first embedded into a vector, then reduced to a scalar by a dense layer with a ReLU activation. Finally, this scalar is concatenated with the two scalars produced in the second-to-last layer of #4, then fed into a final dense layer with a linear activation. Like the previous, this is a linear combination of the three inputs.

While the various network architectures are intruiging for both the ease with which we can incorporate multiple, disparate inputs into our function, and the ability to then train this function end-to-end with respect to our main minimization objective, we find no reason to prefer them over the time-worn Implicit Matrix Factorization algorithm.