- Regression Models
Tags
- ggplot2
- Machine Learning
- Prediction
- R Programming
In this post, the failure pressure will be predicted for a pipeline containing a defect based solely on burst test results and learning machine models. For this purpose, various Machine Learning models will be fitted to test data under R using the caret package, and in the process compare the accuracy of the models in order to identify the best performing one(s).
Importing The Data
The data set to be used, has been extracted from an extensive database of burst test results, the set contains the results of approximately 313 tests which were compiled by GL in 2009. Report as well as the data used here can be obtained using this link.
The data was extracted and inserted into a CSV file, allowing for it to be called and manipulated easily.
We will start by loading all libraries needed to run the analysis and the data set.
1 |
|
We can view all of the selected parameters using pairwise comparison on the basis of the failure mode.
1 |
|
Next step is to partition our data into 2 sets, 90% for training and 10% for testing, these will be balanced splits based on the burst pressure values. The purpose of splitting the data is to allow the algorithms to learn the relationships between the parameters with the training set, followed by the testing set, which will be used as an independent set to make sure our trained model is not over-fitting the data during the training step.
1 |
|
Train the Models
There is a large list of machine learning algorithm supported by the caret package; however, for this analysis, I chose randomly 6 regression models, these are:
-
Model Tree
-
Support Vector Machines with Linear Kernel
-
Random Forest
-
k-Nearest Neighbors
-
Generalized Linear Model
-
Projection Pursuit Regression
We will use repeated cross-validation which is a way to evaluate the performance of a model by randomly partitioning the training set into k equal size subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k-1 subsamples are used as training data. The cross-validation process is then repeated k times, for our analysis we will use 10 folds and 10 repeats.
Each of the 6 models will be built using the function ‘train()’, the relationship between parameters will be expressed as follow:
Purst Pressure = f(Normalised Length, Depth, OD/WT, SMYS)
At the end, we will use the 10% testing data to predict the burst pressure for each trained model, using the function ‘predict()’.
1 |
|
Compare the Models
Now we have trained and tested all of our 6 models, let’s have a look at how each model performed. First plot of the predicted burst pressure Vs. the real burst pressure:
1 |
|
We can see that in overall, the Random Forest appears to be the most accurate model, however, performance is not totally that clear from the plots alone, for a clear comparison we can use some evaluation metrics, these can be viewed by calling the function summary().
1 |
|
It is clear from the above, Random Forst appears to be the optimal model in this analysis, now we can use it in a normal assessment.
First let’s have a look how normal defect assessment method perform when compared to the test data (in this case I have selected the Modified ASME B31G method). For that, we create a function to calculate burst pressure of the defects used in the tests and add results to the parameters we selected previously.
1 |
|
Let’s have a look how Modified B31G burst pressure results compare to tests results.
1 |
|
We can see that in overall, the Modified B31G tend to under-estimate the values of the burst pressure. Now let’s have a practical test with a new data set, created randomly.
1 |
|
In The above guide we only limited the number of models to 6; however, we can include as many as we want for better selection. In the next post, I will be showing how to combine multiple models as an ensemble for better predictions.
Related Post
- Machine learning logistic regression for credit modelling in R
- Commercial data analytics: An economic view on the data science methods
- Weight loss in the U.S. – An analysis of NHANES data with tidyverse
- Machine Learning Results in R: one plot to rule them all! (Part 2 – Regression Models)
- Story of pairs, ggpairs, and the linear regression
Related