Scrape Tweets from Twitter using Python and Tweepy

This tutorial guides you in setting up a system for collecting Tweets. Not in Apache Spark or Apache Flink, but just in Python + Tweepy. In many use cases, just a single computing node can collect enough Tweets to draw decent conclusions. In future blog posts, I will explain how to collect Tweets using a cluster (and with either Apache Spark or Apache Flink). But for now, lets focus on a simple Pythonic harvester! If you are interested in scraping a website, you should definitely read this article.

Tweets are extremely useful for gathering opinions of thousands of people on a particular topic over time. Sadly, Twitter has revoked access to old Tweets (however, this Python package is still capable of doing so by making use of Twitter search functionality).  Therefore, many developers harvest Tweets by using Twitters Streaming API and store them on their computing nodes. If you have enough computing nodes, you could consider collecting Tweets by using a cluster and cluster software, such as Apache Spark or Apache Flink. But if you have a small scale project, one Python script will be enough. In this tutorial, we will build a small Python script for retrieving and storing Tweets from the Streaming API.

Setting up an account

The first thing we need, is an access token for accessing the Twitter API. This can simply be done by visiting apps.twitter.com. Sadly, the policy and terms of service of Twitter changes frequently, so it is hard to explain and update the sign up process on Twitter every time Twitter changes something.

Writing the code

First, make sure you have installed the Python package tweepy. Now we can start writing our code! We will fetch Tweets containing “Python”:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import tweepy
import json

# Specify the account credentials in the following variables:
consumer_key = 'INSERT CONSUMER KEY HERE'
consumer_secret = 'INSERT CONSUMER SECRET HERE'
access_token = 'INSERT ACCESS TOKEN HERE'
access_token_secret = 'INSERT ACCESS TOKEN SECRET HERE'


# This listener will print out all Tweets it receives
class PrintListener(tweepy.StreamListener):
 def on_data(self, data):
 # Decode the JSON data
 tweet = json.loads(data)

 # Print out the Tweet
 print('@%s: %s' % (tweet['user']['screen_name'], tweet['text'].encode('ascii', 'ignore')))

 def on_error(self, status):
 print(status)


if __name__ == '__main__':
 listener = PrintListener()

 # Show system message
 print('I will now print Tweets containing "Python"! ==>')

 # Authenticate
 auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
 auth.set_access_token(access_token, access_token_secret)

 # Connect the stream to our listener
 stream = tweepy.Stream(auth, listener)
 stream.filter(track=['Python'])

Collecting Tweets

In the code, the comments describe what the code does.

Conclusion

It was fairly easy to setup a Tweet harvester! You only need to sign up on Twitter and write a few lines of code using Python and Tweepy. If you have any questions or comments on this articles, please send me a comment below!