Likes Out! Guerilla Dataset!

– Zack de la Rocha

tl;dr -> I collected an implicit feedback dataset along with side-information about the items. This dataset contains around 62,000 users and 28,000 items. All the data lives here inside of this repo. Enjoy!

In a previous post, I wrote about how to use matrix factorization and explicit feedback data in order to build recommendation systems. This is data where a user has given a clear preference for an item such as a star rating for an Amazon product or a numerical rating for a movie like in the MovieLens data. A natural next step is to discuss recommendation systems for implicit feedback which is data where a user has shown a preference for an item like “number of minutes listened” for a song on Spotify or “number of times clicked” for a product on a website.

Implicit feedback-based techniques likely consitute the majority of modern recommender systems. When I set out to write a post on these techniques, I found it difficult to find suitable data. This makes sense - most companies are loathe to share users’ click or usage data (and for good reasons). A cursory google search revealed a couple datasets that people use, but I kept finding issues with these datasets. For example, the million song database was shown to have some issues with data quality, while many other people just repurposed the MovieLens or Netflix data as though it was implicit (which it is not).

This started to feel like one of those “fuck it, I’ll do it myself” things. And so I did.

All code for collecting this data is located on my github. The actual collected data lives in this repo, as well.

Sketchfab¶

Back when I was a graduate student, I thought for some time that maybe I would work in the hardware space (or at a museum, or the government, or a gazillion other things). I wanted to have public, digital proof of my (shitty) CAD skills, and I stumbled upon Sketchfab, a website which allows you to share 3D renderings that anybody else with a browser can rotate, zoom, or watch animate. It’s kind of like YouTube for 3D (and now VR!).

Users can “like” 3D models which is an excellent implicit signal. It turns out you can actually see which user liked which model. This presumably allows one to reconstruct the classic recommendation system “ratings matrix” of users as rows and 3D models as columns with likes as the elements in the sparse matrix.

Okay, I can see the likes on the website, but how do I actually get the data?

Crawling with Selenium¶

When I was at Insight Data Science, I built an ugly script to scrape a tutoring website. This was relatively easy. The site was largely static, so I used BeautifulSoup to simply parse through the HTML.

Sketchfab is a more modern site with extensive javascript. One must wait for the javascript to render the HTML before parsing through it. A method of automating this is to use Selenium. This software essentially lets you write code to drive an actual web browser.

To get up and running with Selenium, you must first download a driver to run your browser. I went here to get a Chrome driver. The Python Selenium package can then be installed using anaconda on the conda-forge channel:

Opening a browser window with Selenium is quite simple: