IP address conversion
At work I recently had to match data on IP addresses and some fuzzy timestamp matching – a mess, to say the least. But before I could even tackle that problem, one dataset had the IPs stored as a character (e.g. 10.0.0.0
), while the other dataset had the IP addresses converted as integers (e.g. 167772160
).
Storing IPs as integers has the advantage of saving some space and making calculations easier. This page goes into detail on how this conversion is made. You split the IP address into the four octets and then shift each octet by sets of 8 bit:
Using 10.0.0.0
as an example:
(10*256^3) + (0*256^2) + (0*256^1) + (0*256^0) = 167772160
Converting in R
Since it’s a simple mathematical conversion, it’s easy to write a function that will convert the IP to integer, and also back. Stackoverflow has an answer here.
Converting in Rcpp
During my googleing, I stumbled across this blogpost, which solved the problem with some CPP code using the magic of boost, a CPP library with lots of nice functions.
Since my data had several millions of rows, generally anything that speeds up conversions is a good idea! I tried the code available at their site which threw some errors on comment characters. After removing the comments, everything worked nicely (don’t forget to install the boost libraries! sudo apt-get install libboost-dev
):
1 |
|
Running the code in R is simple, and you’ll get the result without any problems:
1 |
|
Unfortunately, the result returned is a scalar. Running the command in a mutate()
only returns the first IP for all rows.
So, I took to vectorising the code. Time to grab the excellent advanced R website/book by Hadley. Specifically, the Rcpp section. Checking the cpp code, I noticed that rinet_pton
returns a scalar (unsigned long
), even though a vector is used as an input (CharacterVector
). Moreover, it will always pick the first IP from the character vector input to return: from_string(ip[0])
.
Going by the Rcpp documentation, I changed the inputs and returns to vectors always, and wrote a quick cpp loop to vectorise the functions.
1 |
|
With the functions now vectorised, it’s easy to pass vectors, and run the function on a dataframe column.
1 |
|
I should note that I know nothing of cpp programming, and this was fully hacked by following the Rcpp examples. The rinet_ntop()
function throws an error on passing negative numbers (it expects an unsigned long), so you can’t reconvert the 192.168.0.1
IP to integer and back. This was not a problem for me, since all I needed was to match the IPs, and my integer IPs in the one dataset were created via boost in the first place.
The code is available on github as a gist.