Opinion mining on Dutch news articles

In this blog post, I will learn you how you can mine opinions about companies from news articles. I will share how I scraped thousands of news articles in a few minutes and how one could classify the opinion expressed in the titles of the news articles. This information could be used for example to help with watching competitors of a company or to predict global trends.

What components are needed?

The following components are needed:

  • A method for scraping news.
  • A method for scraping opinions (ratings, etcetera).
  • A model for sentiment for answering the following question: “What words/phrases are related to what sentiment?”

First, the news scraper is explained. Then, the opinion scraper and sentiment model are explained and then the ensemble is tested in the real world!

RSS webscraper for news headlines

I used Scrapy for scraping webpages and RSS. The first thing to do, is to setup the project using Scrapy. This is done by executing the following code:

1
scrapy startproject collect

Then, move into the project folder (as explained in the message when setting up the project) and create a spider pointing at the domain to scrape the news from.

I ended up with the following RSS spider:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Selector
from scrapy.spiders import XMLFeedSpider

from collect.items import NewsItem


class NewsRSSSpider(XMLFeedSpider):
 name = 'news'
 allowed_domains = ['news.nl']
 start_urls = ['https://www.news.nl/sitemap_index.xml']
 namespaces = [
 ('ns', 'https://www.sitemaps.org/schemas/sitemap/0.9'),
 ('news', 'https://www.google.com/schemas/sitemap-news/0.9')
 ]

 def parse(self, response):
 self.logger.info('Parsing %s' % response.url)

 selector = Selector(response, type='xml')
 self._register_namespaces(selector)

 # Extract news items when found
 nodes = selector.xpath('//news:news')
 for node in nodes:
 fields = ['name', 'language', 'genres', 'publication_date', 'title', 'keywords']
 item = NewsItem()
 for field in fields:
 value = node.xpath('news:%s/text()' % field).extract_first()
 item[field] = value
 yield item

 # Follow URLs when found
 nodes = selector.xpath('//ns:loc/text()')
 urls = [node.extract() for node in nodes]
 urls = [url for url in urls if url.endswith('.xml')]
 urls = sorted(urls, reverse=True)
 for i, url in enumerate(urls):
 yield scrapy.Request(url, callback=self.parse)

And the following item:

1
2
3
4
5
6
7
class NewsItem(scrapy.Item):
 name = scrapy.Field()
 language = scrapy.Field()
 genres = scrapy.Field()
 publication_date = scrapy.Field()
 title = scrapy.Field()
 keywords = scrapy.Field()

Then, I can execute the spider and export all headlines to a CSV file:

1
scrapy crawl news -o news.csv -t csv

HTML webscraper for Dutch opinions

Then, I created a webscraper to scrape Dutch opinions. This resulted into the following Python code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# -*- coding: utf-8 -*-
import scrapy

from collect.items import ReviewItem


class OpinionSpider(scrapy.Spider):
 name = 'opinions'
 allowed_domains = ['opinions.nl']
 start_urls = ['https://opinions.nl/opinions?page=1']

 def parse(self, response):
 self.logger.info('Parsing %s' % response.url)

 for node in response.xpath('//li[contains(@class, "review-preview")]'):
 reviewer = node.xpath('div[contains(@class, "review-reviewer")]//h3[contains(@class, "name")]//text()').extract()
 reviewer = ''.join(reviewer).strip()
 feedback = node.xpath('div[contains(@class, "review-feedback")]//text()').extract()
 feedback = ''.join(feedback).strip()
 rating = node.xpath('div[contains(@class, "review-scores")]//span[contains(@class, "review-rating")]//text()').extract_first()
 if rating is None:
 continue
 rating = float(rating.strip().replace(',', '.'))
 yield ReviewItem(user=reviewer, rating=rating, feedback=feedback)

 for node in response.xpath('//a/@href'):
 url = node.extract()
 if url.startswith('/opinions?page='):
 yield scrapy.Request('https://opinions.nl/%s' % url[1:], callback=self.parse)

And the following item:

1
2
3
4
class ReviewItem(scrapy.Item):
 user = scrapy.Field()
 rating = scrapy.Field()
 feedback = scrapy.Field()

I then executed the code and created a CSV file containing thousands of Dutch opinions (opinions.csv).

Training a sentiment model

After scraping all the data, the sentiment model can be build. I used Chainer to create the Neural Network. Then, the opinions.csv is loaded:

1
2
3
import pandas as pd

df_review = pd.read_csv('opinion.csv')

After loading the CSV, preprocessing is done. The preprocessing converts the raw text to character identifiers. Also the rating (scale 1 – 10) is normalized (from 0.0 to 1.0) such that the model can work with it. Then, this is continuous variable is replaced by its discrete counterpart such that there are three classes: negative sentiment (< 0.25), positive sentiment (> 0.75) and neutral sentiment (between 0.25 and 0.75). Also accents are removed and characters which are not in the specified alphabet are removed. This results into the following preprocessing code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import re
from unidecode import unidecode
import nltk

alphabet = list('abcdefghijklmnopqrstuvwxyz

Then, the vocabulary (the alphabet) is defined. An “UNK” marker is added for unknown characters (which will in fact never occur due to the preprocessing). Then, the characters are converted to identifiers:

1
2
3
4
5
6
7
8
9
10
11
12
13
import numpy as np

words = sorted(list(set(word for words in df_review.words.values.tolist() for word in words)))
words = ['UNK'] + words
vocab = {word: i for i, word in enumerate(words)}

def preprocess(text):
 text = text.lower()
 words = nltk.word_tokenize(text)
 return np.asarray([vocab.get(word, vocab.get('UNK')) for word in words], dtype='i')

df_review = df_review.assign(ids=df_review.feedback.apply(preprocess))
df_review

Creating the datasetThe next step is to create an iterator and converter for the data. The iterator is a dataset object in Chainer. The dataset has a len() method which computes the number of items in the dataset and the get_example(i) method fetches the ith example. The converter is used to generate batches from the dataset iterator. This results into the following code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from chainer import dataset

class DutchSentimentDataset(dataset.DatasetMixin):
 def __init__(self, df):
 self.df = df
 
 def __len__(self):
 return len(self.df)

 def get_example(self, i):
 item = self.df.iloc[i]
 return {
 'X': item.ids,
 't': item.sentiment
 }

def converter(batch, device=None):
 return {
 'X': [item['X'] for item in batch],
 't': np.hstack([item['t'] for item in batch]).reshape((-1, 1)),
 }
 
ds = DutchSentimentDataset(df_review_balanced[['sentiment', 'ids']])
batch = [ds.get_example(_) for _ in range(5)]
converter(batch)

This gives the following output:

As you can see, this is a batch consisting of 5 items. All the items have negative sentiment (< 0.25). The character identifiers per opinion are shown.

The model code and training process

Now it is time to define the model and train it! This is done with the following Chainer code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
from chainer.iterators import SerialIterator
import numpy as np
import chainer
import chainer.links as L
import chainer.functions as F
from chainer import optimizers
from chainer import training
from chainer import report
from chainer.training import extensions
import json

def sequence_embed(embed, xs):
 x_len = [len(x) for x in xs]
 x_section = np.cumsum(x_len[:-1])
 ex = embed(F.concat(xs, axis=0))
 exs = F.split_axis(ex, x_section, 0)
 return exs

class RNNEncoder(chainer.Chain):

 def __init__(self, n_source_vocab, n_emb_units):
 super(RNNEncoder, self).__init__()
 with self.init_scope():
 self.embed = L.EmbedID(n_source_vocab, n_emb_units)
 self.rnn = L.NStepBiGRU(1, n_emb_units, n_emb_units, 0.5)
 self.l_out_1 = L.Linear(32)
 self.l_out_2 = L.Linear(1)
 
 def get_hidden_layer(self, xs):
 exs = sequence_embed(self.embed, xs)
 hx, _ = self.rnn(None, exs)
 hx = F.transpose(hx, [1, 0, 2])
 out_1 = self.l_out_1(hx)
 out_1 = F.relu(out_1)
 return out_1
 
 def predict(self, xs):
 out_1 = self.get_hidden_layer(xs)
 out_2 = self.l_out_2(out_1)
 out_2 = F.sigmoid(out_2)
 return out_2
 
 def __call__(self, X, t):
 out = self.predict(X)
 
 batch_size = len(X)
 loss = -F.matmul(F.log(out.T), t) + -F.matmul(F.log(1. - out.T), 1. - t)
 loss = loss / batch_size
 
 report({
 'loss': loss.data[0, 0]
 }, self)
 
 return loss

model = RNNEncoder(len(vocab), 200)
max_epoch = 10

train_iter = SerialIterator(ds, batch_size=50, repeat=False, shuffle=True)

with open('vocab.json', 'w') as fp:
 json.dump(vocab, fp)

optimizer = optimizers.Adam()
optimizer.setup(model)
updater = training.StandardUpdater(train_iter, optimizer, converter=converter)
trainer = training.Trainer(updater, (max_epoch, 'epoch'), out='out')
trainer.extend(extensions.LogReport(trigger=(1, 'iteration')))
trainer.extend(extensions.PrintReport(['epoch', 'iteration', 'main/loss', 'elapsed_time'])) #, 'validation/main/loss'
trainer.extend(extensions.PlotReport(['main/loss'], 'iteration', file_name='loss.png', trigger=(1, 'iteration')))
trainer.extend(extensions.snapshot_object(model, 'model_iter_{.updater.iteration}'), trigger=(50, 'iteration'))
trainer.run()

And now we can train it! The following loss plot was generated:

I run it until there is no more unseen data available (so for 1 epoch). Sadly, I could get better results if I had collected more data. The loss noise can be reduced if I use larger batch sizes.

Using the model

Now the model can compute the sentiment for a given sentence. It can also be used to predict the sentiment for a given news headline. The following sentiment is found using the model (on a sample of a few news headlines):

Translated into English:

As you can see, these results are correct (1.00 means maximum positive sentiment and 0.00 means maximum negative sentiment). However, there are some major drawbacks. The sentence sentiment does not tell you anything about the sentiment per entity. It would be interesting to classify the sentiment per entity.

Conclusions (TL;DR)

With some effort, it is possible to detect sentiment in news articles (in any language). One improvement of the model is to compute sentiment per entity, but that is left as future work.

".,:;?!()\r\n&-+ ') df_review.rating = pd.to_numeric(df_review.rating, errors='coerce') df_review = df_review.dropna() df_review = df_review.assign(sentiment=df_review.rating.apply(lambda x: (x - 1.0) / 9.0)) df_review.feedback = df_review.feedback.apply(unidecode) df_review = df_review.assign(feedback_lowered=df_review.feedback.str.lower()) df_review = df_review.assign(nonalphabet_chars=df_review.feedback_lowered.apply(lambda x: re.sub('[%s]' % re.escape(''.join(alphabet)), '', x))) df_review = df_review[df_review.nonalphabet_chars.apply(len) == 0] df_review = df_review.assign(words=df_review.feedback_lowered.apply(nltk.word_tokenize)) df_review = df_review.assign(target=df_review.sentiment.apply(lambda x: 0.0 if x < 0.25 else 0.5 if x <= 0.75 else 1.0))

Then, the vocabulary (the alphabet) is defined. An “UNK” marker is added for unknown characters (which will in fact never occur due to the preprocessing). Then, the characters are converted to identifiers:

1
2
3
4
5
6
7
8
9
10
11
12
13
import numpy as np

words = sorted(list(set(word for words in df_review.words.values.tolist() for word in words)))
words = ['UNK'] + words
vocab = {word: i for i, word in enumerate(words)}

def preprocess(text):
 text = text.lower()
 words = nltk.word_tokenize(text)
 return np.asarray([vocab.get(word, vocab.get('UNK')) for word in words], dtype='i')

df_review = df_review.assign(ids=df_review.feedback.apply(preprocess))
df_review

Creating the datasetThe next step is to create an iterator and converter for the data. The iterator is a dataset object in Chainer. The dataset has a len() method which computes the number of items in the dataset and the get_example(i) method fetches the ith example. The converter is used to generate batches from the dataset iterator. This results into the following code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from chainer import dataset

class DutchSentimentDataset(dataset.DatasetMixin):
 def __init__(self, df):
 self.df = df
 
 def __len__(self):
 return len(self.df)

 def get_example(self, i):
 item = self.df.iloc[i]
 return {
 'X': item.ids,
 't': item.sentiment
 }

def converter(batch, device=None):
 return {
 'X': [item['X'] for item in batch],
 't': np.hstack([item['t'] for item in batch]).reshape((-1, 1)),
 }
 
ds = DutchSentimentDataset(df_review_balanced[['sentiment', 'ids']])
batch = [ds.get_example(_) for _ in range(5)]
converter(batch)

This gives the following output:

As you can see, this is a batch consisting of 5 items. All the items have negative sentiment (< 0.25). The character identifiers per opinion are shown.

The model code and training process

Now it is time to define the model and train it! This is done with the following Chainer code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
from chainer.iterators import SerialIterator
import numpy as np
import chainer
import chainer.links as L
import chainer.functions as F
from chainer import optimizers
from chainer import training
from chainer import report
from chainer.training import extensions
import json

def sequence_embed(embed, xs):
 x_len = [len(x) for x in xs]
 x_section = np.cumsum(x_len[:-1])
 ex = embed(F.concat(xs, axis=0))
 exs = F.split_axis(ex, x_section, 0)
 return exs

class RNNEncoder(chainer.Chain):

 def __init__(self, n_source_vocab, n_emb_units):
 super(RNNEncoder, self).__init__()
 with self.init_scope():
 self.embed = L.EmbedID(n_source_vocab, n_emb_units)
 self.rnn = L.NStepBiGRU(1, n_emb_units, n_emb_units, 0.5)
 self.l_out_1 = L.Linear(32)
 self.l_out_2 = L.Linear(1)
 
 def get_hidden_layer(self, xs):
 exs = sequence_embed(self.embed, xs)
 hx, _ = self.rnn(None, exs)
 hx = F.transpose(hx, [1, 0, 2])
 out_1 = self.l_out_1(hx)
 out_1 = F.relu(out_1)
 return out_1
 
 def predict(self, xs):
 out_1 = self.get_hidden_layer(xs)
 out_2 = self.l_out_2(out_1)
 out_2 = F.sigmoid(out_2)
 return out_2
 
 def __call__(self, X, t):
 out = self.predict(X)
 
 batch_size = len(X)
 loss = -F.matmul(F.log(out.T), t) + -F.matmul(F.log(1. - out.T), 1. - t)
 loss = loss / batch_size
 
 report({
 'loss': loss.data[0, 0]
 }, self)
 
 return loss

model = RNNEncoder(len(vocab), 200)
max_epoch = 10

train_iter = SerialIterator(ds, batch_size=50, repeat=False, shuffle=True)

with open('vocab.json', 'w') as fp:
 json.dump(vocab, fp)

optimizer = optimizers.Adam()
optimizer.setup(model)
updater = training.StandardUpdater(train_iter, optimizer, converter=converter)
trainer = training.Trainer(updater, (max_epoch, 'epoch'), out='out')
trainer.extend(extensions.LogReport(trigger=(1, 'iteration')))
trainer.extend(extensions.PrintReport(['epoch', 'iteration', 'main/loss', 'elapsed_time'])) #, 'validation/main/loss'
trainer.extend(extensions.PlotReport(['main/loss'], 'iteration', file_name='loss.png', trigger=(1, 'iteration')))
trainer.extend(extensions.snapshot_object(model, 'model_iter_{.updater.iteration}'), trigger=(50, 'iteration'))
trainer.run()

And now we can train it! The following loss plot was generated:

I run it until there is no more unseen data available (so for 1 epoch). Sadly, I could get better results if I had collected more data. The loss noise can be reduced if I use larger batch sizes.

Using the model

Now the model can compute the sentiment for a given sentence. It can also be used to predict the sentiment for a given news headline. The following sentiment is found using the model (on a sample of a few news headlines):

Translated into English:

As you can see, these results are correct (1.00 means maximum positive sentiment and 0.00 means maximum negative sentiment). However, there are some major drawbacks. The sentence sentiment does not tell you anything about the sentiment per entity. It would be interesting to classify the sentiment per entity.

Conclusions (TL;DR)

With some effort, it is possible to detect sentiment in news articles (in any language). One improvement of the model is to compute sentiment per entity, but that is left as future work.