In this article, we will build a basic news search engine that is capable of finding news by keywords. Since this is a complex system, I will first split the system up into smaller modules. The first module is the module that retrieves all news from the internet. This module is called a scraper (or web scraper) and is written in Python. It maintains a file called the index. This is a file that contains a list of documents per keywords. For example, several documents contain the term “music”, so the index contains the term “music” and a list with references to all documents that contain the word “music”. But we will first start with our scraper.
Web scraper
For the web scraper, we will use a queue which contains pages which are about to be scraped. Once these pages are handled, these are put onto a list such that we can ensure that pages are only handled once. There are lots of difficulties which you will encounter during web scraping. One of the issues is that URLs with a hash are (statically) equivalent to URLs without a hash. For example, http://domain.com/#hashpart is statically the same as http://domain.com/. This hashpart can be removed easily:
1 |
|
Another issue is that there are two types of links: links are either relative or absolute. Relative means that the link does not start with the full domain name. For example, http://domain.com/page is an absolute URL and /page is a relative URL. If the base URL was http://domain.com/ then the absolute URL and the relative URL are the same. If the base URL would be http://test.com/, then the absolute URL using this base URL would be http://test.com/page. Making relative links absolute is done using the following method:
1 |
|
Suppose we want to scrape http://domain.com/ and http://www.test.com/. Then we don’t want links starting with http://external.com/. http://domain.com/ and http://www.test.com/ are called base URLs. The following code goes through a given list of URLs and checks whether they are internal, i.e. they are starting with URLs which are given in a list called base_urls:
1 |
|
The implementation for the queue is straightforward. All of the functionality is combined into one class called Scraper:
1 |
|
Now it is easy to call the scraper:
1 |
|
This will give the following output:
1 |
|
1 |
|
And now everything is working together! In order to find sport related URLs, you just have to use the following piece of code:
1 |
|
The result was the following:
So, now our basic news search engine indeed has found the correct URL!
Improvements
What are the next steps? Consider the tokenizer. If you really want to do a good job, dive into a text mining book and learn about tokenization. A simple improvement (but this is discussion material), is to make all tokens lowercase. Then “Cat” and “cat” are both found when searching on “Cat”. Also the scraper has some problems. A website could implement so called spider traps, where the crawler could be trapped by following an infinite number of links. Luckily, there are many open source crawlers which have implemented features to avoid this.
If you like mathematics and if you are interested in word vectors you can read more about it here.
Implementation
The full implementation of the news search engine can be found on GitHub.
Exercises
Try to implement an tokenizer that makes all tokens lowercase. Also try to implement a better “find” method that can also find multiple words (hint: use the tokenizer again!).