In this machine learning fraud detection tutorial, I will elaborate how got I started on the Credit Card Fraud Detection competition on Kaggle. The goal of the task is to automatically identify fraudulent credit card transactions using Machine Learning. My Pythonic approach is explained step-by-step.
Introduction
The PwC global economic crime survey of 2016 suggests that approximately 36% of organizations experienced economic crime. Therefore, there is definitely a need to solve the problem of credit card fraud detection. The task of fraud detection often boils down to outlier detection, in which a dataset is scanned through to find potential anomalies in the data. In the past, this was done by employees which checked all transactions manually. With the rise of machine learning, artificial intelligence, deep learning and other relevant fields of information technology, it becomes feasible to automate this process and to save some of the intensive amount of labor that is put into detecting credit card fraud. In the following sections, my machine learning based Pythonic approach is explained.
Setup
In this article, I use Python 3. Execute the following line to install all the dependencies:
1 |
|
The following list gives an overview of what all the dependencies do:
- Pandas is a library which allows you to perform common statistical operations on your data and quickly skim through your dataset.
- The scikit-learn library has a lot of out-of-the-box Machine Learning algorithms. This is great for testing some simple models.
Gather and First Glance at the Data
The very first step is to gather the data. Download creditcard.csv into your Python project folder. Now we can take a quick look at the data using Pandas. Make a file named explore.py with the following contents:
1 |
|
For the Pandas newcomers, pd is a commonly used abbreviation for Pandas and df is a commonly used abbreviation for DataFrame. A DataFrame is one of the main data structures in the Pandas library.
If the code is executed, the following is shown:
1 |
|
Model
The next step is to read the data:
1 |
|
Now we have the data in place, we can select the features we would like to use during the training:
1 |
|
Notice that some of the variables have a wide range of values (like the Amount variable). In order to get all variables in an equivalent range, we subtract the mean and divide by the standard deviation such that the distribution of the values is normalized:
1 |
|
Okay, now it is time for some action! We will first define the model (the Logistic Regression model) and then loop through a train and test set which have approximately the same Class distribution. First, we train the model by the train set and then valide the results with the test set. The StratisfiedShuffleSplit makes sure that the Class variable has roughly the same distribution in both the train set and the test set. The random state specification makes sure that the result is deterministic: in other words, we will get the same results if we would run the analysis again. The normalization is done for both the train set and test set. If this was done before the split, some information of the test set would be used in the normalization of the train set and this is not fair since the test set is not completely unseen then. The following code does the job:
1 |
|
This results into the following:
This is actually a great result! The 0 classes (transactions without fraud) are predicted with 100% precision and recall. It has some issues with detecting the 1 classes (transactions which are fraudulent). It can predict fraud with 88% precision. This means that 12% of the transactions which are fraudulent remain undetected by the system. But, 88% is still quite good!
Conclusion (TL;DR)
This machine learning fraud detection tutorial showed how to tackle the problem of credit card fraud detection using machine learning. It is fairly easy to come up with a simple model, implement it in Python and get great results for the Credit Card Fraud Detection task on Kaggle. If you have any questions or comments, feel free to ask in the comment section below! If you are interested in another solution to another Kaggle competition about house prices, definitely check out this article.