The proliferation of e-commerce websites such as Amazon and content-based subscription services like Netflix and Spotify has made it essential that the right product or item is delivered to the right customer. This is one area where big data shines. In a nutshell, we want to answer the following question: What product should we recommend to the customer?
In general, there are two approaches:
-
Look at the set of products and use product information to find similar products. Thus, when a user opens a link to a specific product, recommend the top 5 products most similar to the viewed product. This approach is called content-based since we are looking at the product, not at the customer.
-
Look at the transaction history of your customers. For a specific customer, find customers with similar transaction history and recommend the top 5 products that similar customers bought but the customer under study hasn’t yet. This approach is called collaborative filtering since we’re taking into account user-product interactions and collaboration among customers.
Content-Based Recommendation Systems
Content-based systems leverage product information for its recommendations. For example, if a person looks at the book Tuesdays with Morrie by Mitch Albom on your online bookstore, a content-based system would probably recommend The Five People You Meet in Heaven or For One More Day because it would look at the author field.
There are many ways we can build a content-based system. The limit is usually set by the amount of data you have on your products. For a given product, we build a numerical vector composed of features derived from the available metadata. One feature could be, for example, the category it belongs to (Appliance vs Fashion), the year it was published (if you are selling books), or the actors involved (if you’re recommending movies). Once feature vectors are defined for all products, we can get the (cosine) similarity among all products in the database. Given a specific product, we can find the top 5 most similar items and recommend these items once a customer looks at the product in question.
Usually, the tools of Natural Language Processing are very useful for content-based systems since products are usually accompanied by text descriptions. What we need to do is transform the text into features. There are two approaches: bag-of-words (BOW) models and embedding models. Bag-of-word models do not take into account the context of the word, so ‘good man’ and ‘man good’ are the same from the BOW point of view. On the other hand, embedding models like Word2Vec and Doc2Vec transform words an documents into dense vectors which encode context and meaning. The canonical example is the observation that Wor2Vec discovers the relation “king + woman – man = queen.”
More sophisticated models also take into account the image of the item. For this, tools in Computer Vision such as convolutional neural networks can be used.
The upside to the content-based methods is that we don’t really need a lot of transactions to build the models – we just need information on the products. The downside is that the model doesn’t learn from transactions, so the performance of content-based systems doesn’t really improve over time.
Collaborative Filtering
Collaborative filtering leverages product transactions to give recommendations. In this type of model, for a specific customer, we find similar customers based on transaction history and recommend items that the customer in question hasn’t purchased yet and which the similar customers tended to like.
The standard approach to carry out collaborative filtering is non-negative matrix factorization (NMF). We first construct the sparse interaction matrix I which has customers in its rows and products in its columns. We then try to decompose I to a product of two matrices C and P^T. Here, C is the customer matrix, wherein each row is a customer and each column is the affinity of the customer to the latent feature discovered by the NMF. Similarly, P is the product matrix, wherein each row is a product and each column is the extent the product embodies the latent feature.
These latent features are discovered by the NMF procedure. It is up to the modeler to decide how many latent features you want for the factorization. For example, suppose that we want 2 latent features in our movie recommendation engine. Latent feature discovered by NMF could be if whether the movie is a horror movie (that’s latent feature one) and whether the movie is a romance movie (latent feature two). A row of C then represents the customer’s affinity to horror and romance, while a row of P represents the extent the movie belongs to horror and romance. Similar to content-based systems, we can measure the similarity of these two vectors with cosine similarity to infer how much each customer likes each movie. Using that information, we can recommend movies to users.
When people mention recommendation systems, what they usually mean is the collaborative-filtering model. This type of model leverages transaction data, so its performance improves over time as more and more transactions are made. This is its advantage over content-based systems. The downside is that this method is very transaction-dependent. So if you have a new online store, you cannot expect to make good recommendations given a collaborative-filtering-type model. This is called the cold start problem.
Hybrid Systems
Hybrid systems leverage both item metadata and transaction data to give recommendations. There are many ways we can mix and match collaborative filtering models and content-based models. One such model is called Light FM, made by the fashion company Lyst.
Light FM uses item features, user features and transaction data to give recommendations. It does so by representing item features and user features in a latent embedding similar to Word2Vec. Then, each item and each user is represented by the vector sum of the latent features. Recommendations are made for a user by sorting the cosine similarity between the user’s vector representation and each item’s representation.
Feedback: Explicit vs. Implicit
One issue that e-commerce websites usually face is a lack of feedback. For Amazon, there are explicit 1- to 5-star reviews for each product, so making recommendation systems is a relatively straightforward task. This type of data is explicit feedback. What can one do if there aren’t a lot of reviews for your products? You can leverage implicit data, such as the number of orders made for a certain product or the number of clicks that the item receives. These constitute implicit feedback. LightFM has models that take into account either explicit and implicit feedback.
The Bottomline
The literature on Recommendation Systems is very large and there are still lots of open questions. To get started on the topic, the resource I recommend is Recommender Systems by Aggarwal. It is very comprehensive and perfect for new learners on the topic.