Machine Learning Fraud Detection： A Simple Machine Learning Approach

In this machine learning fraud detection tutorial, I will elaborate how got I started on the Credit Card Fraud Detection competition on Kaggle. The goal of the task is to automatically identify fraudulent credit card transactions using Machine Learning. My Pythonic approach is explained step-by-step.

Introduction

The PwC global economic crime survey of 2016 suggests that approximately 36% of organizations experienced economic crime. Therefore, there is definitely a need to solve the problem of credit card fraud detection. The task of fraud detection often boils down to outlier detection, in which a dataset is scanned through to find potential anomalies in the data. In the past, this was done by employees which checked all transactions manually. With the rise of machine learning, artificial intelligence, deep learning and other relevant fields of information technology, it becomes feasible to automate this process and to save some of the intensive amount of labor that is put into detecting credit card fraud. In the following sections, my machine learning based Pythonic approach is explained.

Setup

In this article, I use Python 3. Execute the following line to install all the dependencies:

1	`pip install pandas==0.19.2 scikit-learn==0.18.1`

The following list gives an overview of what all the dependencies do:

Pandas is a library which allows you to perform common statistical operations on your data and quickly skim through your dataset.
The scikit-learn library has a lot of out-of-the-box Machine Learning algorithms. This is great for testing some simple models.

Gather and First Glance at the Data

The very first step is to gather the data. Download creditcard.csv into your Python project folder. Now we can take a quick look at the data using Pandas. Make a file named explore.py with the following contents:

import pandas as pd

# Read the CSV file
df = pd.read_csv('creditcard.csv')

# Show the contents
print(df)

For the Pandas newcomers, pd is a commonly used abbreviation for Pandas and df is a commonly used abbreviation for DataFrame. A DataFrame is one of the main data structures in the Pandas library.

If the code is executed, the following is shown:

from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import pandas as pd

Model

The next step is to read the data:

1	`data = pd.read_csv('creditcard.csv')`

Now we have the data in place, we can select the features we would like to use during the training:

# Only use the 'Amount' and 'V1', ..., 'V28' features
features = ['Amount'] + ['V%d' % number for number in range(1, 29)]

# The target variable which we would like to predict, is the 'Class' variable
target = 'Class'

# Now create an X variable (containing the features) and an y variable (containing only the target variable)
X = data[features]
y = data[target]

Notice that some of the variables have a wide range of values (like the Amount variable). In order to get all variables in an equivalent range, we subtract the mean and divide by the standard deviation such that the distribution of the values is normalized:

def normalize(X):
 Make the distribution of the values of each variable similar by subtracting the mean and by dividing by the standard deviation.
 """
 for feature in X.columns:
 X[feature] -= X[feature].mean()
 X[feature] /= X[feature].std()
 return X

Okay, now it is time for some action! We will first define the model (the Logistic Regression model) and then loop through a train and test set which have approximately the same Class distribution. First, we train the model by the train set and then valide the results with the test set. The StratisfiedShuffleSplit makes sure that the Class variable has roughly the same distribution in both the train set and the test set. The random state specification makes sure that the result is deterministic: in other words, we will get the same results if we would run the analysis again. The normalization is done for both the train set and test set. If this was done before the split, some information of the test set would be used in the normalization of the train set and this is not fair since the test set is not completely unseen then. The following code does the job:

# Define the model
model = LogisticRegression()

# Define the splitter for splitting the data in a train set and a test set
splitter = StratifiedShuffleSplit(n_splits=1, test_size=0.5, random_state=0)

# Loop through the splits (only one)
for train_indices, test_indices in splitter.split(X, y):
 # Select the train and test data
 X_train, y_train = X.iloc[train_indices], y.iloc[train_indices]
 X_test, y_test = X.iloc[test_indices], y.iloc[test_indices]
 
 # Normalize the data
 X_train = normalize(X_train)
 X_test = normalize(X_test)
 
 # Fit and predict!
 model.fit(X_train, y_train)
 y_pred = model.predict(X_test)
 
 # And finally: show the results
 print(classification_report(y_test, y_pred))

This results into the following:

This is actually a great result! The 0 classes (transactions without fraud) are predicted with 100% precision and recall. It has some issues with detecting the 1 classes (transactions which are fraudulent). It can predict fraud with 88% precision. This means that 12% of the transactions which are fraudulent remain undetected by the system. But, 88% is still quite good!

Conclusion (TL;DR)

This machine learning fraud detection tutorial showed how to tackle the problem of credit card fraud detection using machine learning. It is fairly easy to come up with a simple model, implement it in Python and get great results for the Credit Card Fraud Detection task on Kaggle. If you have any questions or comments, feel free to ask in the comment section below! If you are interested in another solution to another Kaggle competition about house prices, definitely check out this article.