Building a news search engine

In this article, we will build a basic news search engine that is capable of finding news by keywords. Since this is a complex system, I will first split the system up into smaller modules. The first module is the module that retrieves all news from the internet. This module is called a scraper (or web scraper) and is written in Python. It maintains a file called the index. This is a file that contains a list of documents per keywords. For example, several documents contain the term “music”, so the index contains the term “music” and a list with references to all documents that contain the word “music”. But we will first start with our scraper.

Web scraper

For the web scraper, we will use a queue which contains pages which are about to be scraped. Once these pages are handled, these are put onto a list such that we can ensure that pages are only handled once. There are lots of difficulties which you will encounter during web scraping. One of the issues is that URLs with a hash are (statically) equivalent to URLs without a hash. For example, http://domain.com/#hashpart is statically the same as http://domain.com/. This hashpart can be removed easily:

def url_remove_hash(url):
 """
        Remove the hash part from an URL:

        http://domain.com/#hashpart --> http://domain.com/

        :param url: URL to remove the hash part from.
        :return: URL without hash part.
        """
 if '#' not in url:
 return url
 hash_index = url.index('#')
 return url[:hash_index]

Another issue is that there are two types of links: links are either relative or absolute. Relative means that the link does not start with the full domain name. For example, http://domain.com/page is an absolute URL and /page is a relative URL. If the base URL was http://domain.com/ then the absolute URL and the relative URL are the same. If the base URL would be http://test.com/, then the absolute URL using this base URL would be http://test.com/page. Making relative links absolute is done using the following method:

def compute_absolute_urls(base_url, urls):
 """
        Compute absolute URLs.

        Given base URL 'http://domain.com/', the path /123 will become 'http://domain.com/123'.

        :param base_url: The prefix of all relative URLs.
        :param urls: A list containing absolute and relative URls.
        :return: A list containing absolute URLs.
        """
 new_urls = set()
 if len(base_url) > 0 and base_url[-1] is not '/':
 base_url += '/'
 for url in urls:
 if url is not None and len(url) > 0:
 if '://' in url:
 new_urls.add(url)
 else:
 while len(url) > 0 and url[0] is '/':
 url = url[1:]
 new_urls.add(base_url + url)
 return list(new_urls)

Suppose we want to scrape http://domain.com/ and http://www.test.com/. Then we don’t want links starting with http://external.com/. http://domain.com/ and http://www.test.com/ are called base URLs. The following code goes through a given list of URLs and checks whether they are internal, i.e. they are starting with URLs which are given in a list called base_urls:

def remove_external_urls(base_urls, urls):
 """
 Remove URLs which are not internal.

 For example, if base_urls = ['http://domain.com', 'http://domain2.com'],
 then 'http://domain.com/page1' is internal (because there is a base URL 'http://domain.com'.
 'http://domain3.com/pageX' is external (because there is no base URL).

 :param base_urls: List of base URLs.
 :param urls: List of URLs to filter.
 :return: Filtered list of URLs.
 """
 internal_urls = set()
 for url in urls:
 for base_url in base_urls:
 if base_url in url and url.index(base_url) is 0:
 internal_urls.add(url)
 return internal_urls

The implementation for the queue is straightforward. All of the functionality is combined into one class called Scraper:

class Scraper:

 def __init__(self, base_urls, max_urls):
 """
        Initialize the scraper.

        :param base_urls: A list of base URLs from which we will scrape documents.
        :param max_urls: The maximum number of documents to parse.
        :return:
        """

 # The maximum number of urls and the number of parsed urls
 self.max_urls = max_urls
 self.num_urls_parsed = 0

 # A list containing all urls which are already handled
 self.done = set()

 # The list of base urls and a queue which will be used during the parsing phase
 self.base_urls = base_urls
 self.queue = []
 self.add_to_queue(self.base_urls)

 def scrape(self):
 """
        The scrape function will scrape the links which are intern with respect to the base URLs.
        """
 while len(self.queue) > 0 and self.num_urls_parsed < self.max_urls:
 url = self.queue.pop(0)
 html = self.fetch_html(url)
 urls = self.fetch_urls(html)
 urls = self.compute_absolute_urls(url, urls)
 urls = set(self.url_remove_hash(url) for url in urls)
 urls = self.remove_external_urls(self.base_urls, urls)
 self.add_to_queue(urls)
 self.sort_queue()
 self.num_urls_parsed += 1
 print("Scraping... %s / %s (%s)" % (self.num_urls_parsed, self.max_urls, url))
 pass

 @staticmethod
 def url_remove_hash(url):
 """
        Remove the hash part from an URL:

        http://domain.com/#hashpart --> http://domain.com/

        :param url: URL to remove the hash part from.
        :return: URL without hash part.
        """
 if '#' not in url:
 return url
 hash_index = url.index('#')
 return url[:hash_index]

 @staticmethod
 def remove_external_urls(base_urls, urls):
 """
        Remove URLs which are not internal.

        For example, if base_urls = ['http://domain.com', 'http://domain2.com'],
        then 'http://domain.com/page1' is internal (because there is a base URL 'http://domain.com'.
        'http://domain3.com/pageX' is external (because there is no base URL).

        :param base_urls: List of base URLs.
        :param urls: List of URLs to filter.
        :return: Filtered list of URLs.
        """
 internal_urls = set()
 for url in urls:
 for base_url in base_urls:
 if base_url in url and url.index(base_url) is 0:
 internal_urls.add(url)
 return internal_urls

 @staticmethod
 def compute_absolute_urls(base_url, urls):
 """
        Compute absolute URLs.

        Given base URL 'http://domain.com/', the path /123 will become 'http://domain.com/123'.

        :param base_url: The prefix of all relative URLs.
        :param urls: A list containing absolute and relative URls.
        :return: A list containing absolute URLs.
        """
 new_urls = set()
 if len(base_url) > 0 and base_url[-1] is not '/':
 base_url += '/'
 for url in urls:
 if url is not None and len(url) > 0:
 if '://' in url:
 new_urls.add(url)
 else:
 while len(url) > 0 and url[0] is '/':
 url = url[1:]
 new_urls.add(base_url + url)
 return list(new_urls)

 @staticmethod
 def fetch_html(url):
 """
        Fetch HTML from a given URL.

        :param url: URL to fetch HTML for.
        :return: HTML.
        """
 import urllib.request
 try:
 html = urllib.request.urlopen(url).read()
 html = html.decode('utf-8')
 except urllib.error.HTTPError:
 html = ''
 return html

 @staticmethod
 def fetch_urls(html):
 """
        Fetch URLs from HTML.

        :param html: HTML to fetch URLs from.
        :return: List of URLs.
        """
 from bs4 import BeautifulSoup
 urls = []
 soup = BeautifulSoup(html, 'lxml')
 for link in soup.findAll('a'):
 urls.append(link.get('href'))
 return urls

 @staticmethod
 def fetch_text(html):
 """
        Fetch all text found inside HTML.

        :param html: HTML to find text in.
        :return: Found text.
        """
 import html2text
 return html2text.html2text(html)

 def sort_queue(self):
 """
        Sort the queue such that shorter URLs are handled first.
        """
 self.queue = sorted(self.queue, key=lambda url: len(url))

 def add_to_queue(self, urls):
 """
        Add URLs to the queue.

        :param urls: URLs to add.
        """
 for url in urls:
 if url not in self.done:
 self.queue.append(url)
 self.done.add(url)

Now it is easy to call the scraper:

# Scrape 5 pages from two news websites
scraper = Scraper(['http://www.bbc.com/', 'http://phys.org/'], 5)
scraper.scrape()

This will give the following output:

class Indexer:

 def __init__(self):
 self.data = {}

 def index(self, url, text):
 tokens = self.get_tokens(text)
 for token in tokens:
 if token not in self.data:
 self.data[token] = []
 if url not in self.data[token]:
 self.data[token].append(url)

 def find(self, token):
 if token in self.data:
 return self.data[token]
 return []

 @staticmethod
 def get_tokens(text):
 return text.split()

def scrape(self, indexer):
 """
        The scrape function will scrape the links which are intern with respect to the base URLs.
        """
 while len(self.queue) > 0 and self.num_urls_parsed < self.max_urls:
 url = self.queue.pop(0)
 html = self.fetch_html(url)
 urls = self.fetch_urls(html)
 urls = self.compute_absolute_urls(url, urls)
 urls = set(self.url_remove_hash(url) for url in urls)
 urls = self.remove_external_urls(self.base_urls, urls)
 text = self.fetch_text(html)
 indexer.index(url, text)
 self.add_to_queue(urls)
 self.sort_queue()
 self.num_urls_parsed += 1
 print("Scraping... %s / %s (%s)" % (self.num_urls_parsed, self.max_urls, url))

And now everything is working together! In order to find sport related URLs, you just have to use the following piece of code:

# Setup the indexer
indexer = Indexer()

# Scrape some pages from two news websites
scraper = Scraper(['http://www.bbc.com/', 'http://phys.org/'], 20)
scraper.scrape(indexer)

# Display all found sport articles
print('Found sport documents:')
for url in indexer.find('sport'):
 print(url)

The result was the following:

So, now our basic news search engine indeed has found the correct URL!

Improvements

What are the next steps? Consider the tokenizer. If you really want to do a good job, dive into a text mining book and learn about tokenization. A simple improvement (but this is discussion material), is to make all tokens lowercase. Then “Cat” and “cat” are both found when searching on “Cat”. Also the scraper has some problems. A website could implement so called spider traps, where the crawler could be trapped by following an infinite number of links. Luckily, there are many open source crawlers which have implemented features to avoid this.

If you like mathematics and if you are interested in word vectors you can read more about it here.

Implementation

The full implementation of the news search engine can be found on GitHub.

Exercises

Try to implement an tokenizer that makes all tokens lowercase. Also try to implement a better “find” method that can also find multiple words (hint: use the tokenizer again!).