Motivation There are dozens of machine learning algorithms out there. It is impossible to learn all their mechanics, however, many algorithms sprout from the most established algorithms, e.g. ordinary least squares, gradient boosting, support vector machines, tree-based algorithms and neural networks. At STATWORX we discuss algorithms daily to evaluate their usefulness for a specific project. In any case, understanding these core algorithms is key to most machine learning algorithms in the literature.
While I like reading machine learning research papers, the maths is sometimes hard to follow. That is why I like implementing the algorithms in R by myself. Of course this means digging through the maths and the algorithms as well. However, you can challenge your understanding of the algorithm directly.
In my two subsequent blog post I will introduce two machine learning algorithms in 150 lines of R Code. This blog post will be about regression trees, which are the foundation of most tree-based algorithms. You can find the other blog post about coding gradient boosted machines from scratch on our blog. The algorithms will cover all core mechanics, while being very generic. You can find all code on my GitHub.
Gathering all puzzle pieces
Surely, there are tons of great articles out there which explain regression trees theoretically accompanied with a hands-on example. This is not the objective of this blog post. If you are interested in a hands-on tutorial with all necessary theory I strongly recommend this tutorial. The objective of this blog post is to establish the theory of the algorithm by writing simple R code. You do not need any prior knowledge of the algorithm to follow. The only thing you need to know is our objective: We want to estimate our real-valued target (y
) with a set of real-valued features (X
).
Most probably you are already familiar with decision trees, which is a machine learning algorithm to solve classification tasks. As the name itself states regression trees solve regressions, i.e. estimation with continuous scaled targets. These kind of trees are the key part of every tree-based method, since the way you grow a tree is more of the same really. The differing parts among the implementations are mostly about the splitting rule. In this tutorial we will program a very simple, but generic implementation of a regression tree.
Fortunately, we do not have to cover much maths in this tutorial, because the algorithm itself is rather a technical than a mathematical challenge. With that said, the technical path I have chosen might not be the most efficient way, but I tried to trade-off efficiency with simplicity.
Anyway, as most of you might know decision or regression trees are rule-based approaches. Meaning, we are trying to split the data into partitions conditional to our feature space. The data partitioning is done with the help of a splitting criterion. There is no common ground on how to do those splits, there are rather multiple different splitting criteria with different pros and cons. We will focus on a rather simple criterion in this tutorial. Bear with me, here comes some maths.
Ok, so what does this state? This is the sum of squared errors determined in two different subsets ( and ). As the name suggests, that should be something we want to minimize. In fact, it is the squared distance between the mean and the target within this data subset. In every node of our regression tree we calculate the SSE for every potential split we could do in our data for every feature we have to figure out the best split we can achieve.
Let us have a look at the R Code:
1 |
|
The function takes two inputs our numeric feature x
and our target real-valued y
. We then go ahead and calculate the SSE for every unique value of our x
. This means we calculate the SSE for every possible data subset we could obtain conditional on the feature. Often we want to cover more than one feature in our problem, which means that we have to run this function for every feature. As a result, the best splitting rule has the lowest SSE among all possible splits of all features. Once we have determined the best splitting rule we can split our data into these two subsets according to our criterion, which is nothing else than feature x <= split_at
and x > split_at
. We call these two subsets children and they again can be split into subsets again.
Let us lose some more words on the SSE though, because it reveals our estimator. In this implementation our estimator in the leaf is simply the avarage value of our target within this data subset. This is the simplest version of a regression tree. However, with some additional work you can apply more sophisticated models, e.g. an ordinary least squares fit.
The Algorithm
Enough with the talking, let’s get to the juice. In the following you will see the algorithm in all of its beauty. Afterwards we will breakdown the algorithm in easy-to-digest code chunks.
1 |
|
Uff, that was a lot of scrolling. Ok, so what do we have here. At first glance we see two loops a while loop and a for loop, which are essential to the algorithm. But let us start bit by bit. Let’s have a look at the first code chunk.
1 |
|
Essentially, this is everything before the while loop. Here, we see some input handling in the beginning and the extraction of our design matrix X
and our target y
. All our featuers are within this design matrix. do_splits
is the condition for our while loop, which we will cover in a bit. The data frame tree_info
is the key storage element in our algorithm, because it will contain every information we need about our tree. The essential piece of this object is the filter column. This column saves the paths (filter) we have to take through (to apply to) our data to get to a leaf (a terminal node) in our regression tree. We initiate this data.frame
with NODE = 1
, since at the beginning we are in the root node of our tree with the whole data set at our expense. Furthermore, there is a column called TERMINAL
, which controls the state of the node. We have three different states SPLIT
, LEAF
and PARENT
. When we describe a node with the SPLIT
state, we mark it for a potential split. The state PARENT
indicates, that we have already split this node. Lastly, the state LEAF
marks terminal nodes of our regression tree.
Reaching the treetop
When do we reach such a terminal node? In many implementations there is a minimum size parameter, where we determine valid splits by the amount of observations of its children. If the children have lower data points than the minimum size the split is invalid and will not be done. Imagine not having this parameter in our case, we could end up with leafs covering only one observation. Another termination rule is if the lowest SSE is at a split we have already invoked within the branch and hence has already been invoked in this branch. The split at such a point would be nonsense, since we would end up with the same subset over and over again.
Basically, this is everything there is to the algorithm. The while and for loop just ensure, that we estimate and create every node in our tree_info
. Thus, you can perceive the tree_info
as sort of a job list, since we create new jobs within this data frame. Let us walk through the actual R code.
1 |
|
The while loop covers our tree depth and our for loop all splits within this certain depth. Within the while loop we seek every row in our tree_info
, which we still have to estimate, i.e. all nodes in the "SPLIT"
state. In the first iteration it would be the first row, our root node. The for loop iterates over all potential splitting nodes. The first if condition ensures that we filter our data according to the tree depth. Of course, there is no filter in the root node that is why we take the data as is. But imagine calculating possible splits for a parent in depth level two. The filter would look similar to this one: "feature_1 > 5.33 & feature_2 <= 3.22"
.
How do we apply splitting?
Afterwards we seek the minimum SSE by applying the sse_var
function to every feature. Note, that in this version we can only handle numeric features. Once we have found the best splitting variable, here named tmp_splitter
, we build the according filter rule in object tmp_filter
.
We still have to check if this is a valid split, i.e. we have not invoked this split for this branch yet and we have sufficient observations in our children. Our indicator split_here
rules whether we split our node. Well, that is about it. In the last passage of the loop we prepare the output for this node. That is, handling the state of the calculated node and adding the children information to our job list tree_info
. After the for loop we have to check whether our tree is fully grown. The variable do_splits
checks whether there are any nodes left we have to calculate. We terminate the calculation if there are no "SPLIT"
nodes left in our tree_info
.
Extracting our estimates
1 |
|
At the end of our calculation we have a filter rule for every leaf in our tree. With the help of these filters we can easily calculate the fitted values by simply applying the filter on our data and calculating our fit, i.e. the mean of our target in this leaf. I am sure by now you can think of a way to implement more sophisticated estimators, which I would leave up to you.
Well, that’s a regression tree with minimum size restriction. I have created a little runthrough with data from my simulation package on my GitHub, which you can checkout and try everything on your own. Make sure to checkout my other blog post about coding gradient boosted machines from scratch.
Über den Autor
André Bleier
The most exciting part of being a data scientist at STATWORX is to find this unique solution to a problem by fusing machine learning, statistics and business knowledge.
STATWORXis a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI.
**