How to mine newsfeed data and extract interactive insights in Python

It looks that there is a dominant cluster scattered all over the space: this is mainly due to the general category of news.

By hovering on each point you can see the corresponding description. At first sight you could notice that they deal approximately with the same topic. This is coherent since we build our clusters using similarities between relevant keywords.

You can look up a topic in the dataframe above (based on its keywords), and switch to this plot and see the corresponding articles.

Kmeans separates the documents into disjoint clusters. the assumption is that each cluster is attributed a single topic.

However, descriptions may in reality be characterized by a “mixture” of topics. For example, let’s take an article that deals with the hearing that Zuckerberg had in front of the congress: you’ll obviously have different topics rising based on the keywords: Privacy, Technology, Facebook app, data, etc.

We’ll cover how to deal with this problem with the LDA algorithm.