Learn How To Implement a Simple E-mail Spam Detector in Python

Nowadays, many e-mail are being sent and many of them are spam e-mails. In this tutorial, I will guide you through the steps of building a small spam detection application (written in Python).

Many data scientist prefer to use this Python library: Scikit-learn. This library is capable of doing all kinds of Machine Learning magic. So, make sure that you install this library first.

Introduction

The spam detection problem is in fact a text classification problem. An e-mail (a text document) is either “spam” or “no spam”. In text mining, this is called single-label text classification, since there is only one label: “spam”. A classifier is an algorithm that is capable of telling whether a text document is either “spam” or “no spam”. In this article, we first setup a small dataset on which we will train a classifier. Then, we will create a test dataset and view the results. Before we can do all of this, we need to extract features from the texts.

Feature extraction

Take a look at the following e-mails:

$1,000 ALARMING!An appointment on July 2ndNew course content for Text ClassificationBUY VIAGRA!!

1
2
import numpy as np
from sklearn.linear_model import SGDClassifier

So, the burning question is which features should we use? It seems that e-mails with many “!”, “$” or capitals are probably spam e-mails. So, lets create 3 features and lets create our training and test set. The feature vector consists of three entries where the first entry is 1 if there is a “!” in the text (a 0 otherwise). The second entry is 1 if there is a “$” in the text (0 otherwise) and the last entry is the ratio of uppercase characters with respect to the sum of the number of uppercase characters and the number of lowercase characters.

1
2
3
4
5
6
7
# Define a training set
training_data = ["$1,000 ALARMING!", "An appointment on July 2nd", "New course content for Text Classification", "BUY VIAGRA!!"]
training_isspam = [True, False, False, True]

# Define the testing set
testing_data = ["New course content for Information Retrieval", "MAkE $$$!", "Grades available for Text Mining", "SELL HOUSE FOR $1,000,000!!"]
testing_isspam = [False, True, False, True]

For the feature extraction, we can write the following method:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
def extract_features(text):
 """
    Extract features from a given text.

    :param text: Text to extract features for.
    :return: A vector where:
                    - The 0th element is 1 if there is a "!" inside the text (0 otherwise).
                    - The 1th element is 1 if there is a "$" inside the text (0 otherwise).
                    - The 2nd element is the ratio of uppercase characters with respect to the sum of all uppercase and
                      lowercase characters.
    """
 features = np.zeros((3,))
 if "!" in text:
 features[0] = 1
 if "$" in text:
 features[1] = 1
 # A list consisting of lowercase characters
 lowercase = list('abcdefghijklmnopqrstuvwxyz')
 # A list consisting of uppercase characters
 uppercase = list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
 # Set the counts of lowercase and uppercase characters to 0
 num_lowercase = 0
 num_uppercase = 0
 # And count the lowercase and uppercase characters
 for char in text:
 if char in lowercase:
 num_lowercase += 1
 elif char in uppercase:
 num_uppercase += 1
 # Define the third feature as the ratio of uppercase characters
 features[2] = num_uppercase / (num_lowercase + num_uppercase)
 return features
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Make an array of features where the ith row corresponds to the ith documents and the columns correspond to the features
training_features = np.vstack([extract_features(training_data[i]) for i in range(len(training_data))])

# Make an Stochastic Gradient Descent classifier
clf = SGDClassifier()
# And fit it to the training set
clf.fit(training_features, training_isspam)

# Predict the labels of the test set
for test_index in range(len(testing_data)):
 features = extract_features(testing_data[test_index])
 print("Test case:")
 print(20 * '=')
 print('Text:', 4 * "t", testing_data[test_index])
 print('Features:', 3 * "t", features)
 print('Predicted is spam:', "t", clf.predict(features))
 print('Is spam:', 3 * "t", testing_isspam[test_index])
 print('')

You might also be interested in text classification using Neural Networks. This article explains what neural networks are and this article shows an advanced implementation of a neural network in Python.

Results

After executing the code, we get the following results:

As you can see, the classifier does the right thing! Now we can classify little spammy e-mail messages.

Exercise

Try to implement more features, like the words used in text messages. For example, the word “VIAGRA” is often used in spam e-mails. Try to classify some more text messages. If you need any help, you can send me a (not so spammy) e-mail message.