In this short note I will write about the httr package and my need to detectwhether or not an HTTP request had been redirected or not – it turns out this is quite easy. Along the way I will also show how to access information of an HTTP-conversation other than the actual content to be retrieved.
I am the creator and maintainer of the robotstxt package an R package that enables users to retrieve and parse robots.txt files and ultimately is designed to do access permission checking for web resources.
Recently a discussion came up about how to interpret permissions in case of sub-domains and HTTP redirects. Long story short: In case of robots.txt files redirects are suspicious and users should at least be informed about it happening so they might take appropriate action.
So, I set out to find a way to check whether or not a robots.txt files requested via the httr package has gone through one or more redirects prior to its retrieval.
httr’s automatic handling of redirects is one of its many wonderful features and happens silently in the background. Despite the fact that httr hides this process it turns out that the location negotiation process is transparently logged within the return value of httr’s HTTP functions (e.g. httr::GET()
). Furthermore it is easy to tap into that information for further processing.
Now let’s get our hands dirty with some real HTTP-requests done via httr. For this we use httpbin.org a service allowing to test numerous HTTP-interaction scenarios – e.g. simple HTTP GET requests with redirection.
When executing an HTTP GET request against the httpbin.org/redirect/2 endpoint it leads to two redirects before finally providing a resource. At first glance the result and status code looks pretty normal …
1 |
|
… the status is 200 (everything OK)…
1 |
|
… and we get some content back.
1 |
|
1 |
|
So far, so good. If we look further into the response object we see that among its items there are two of particular interest: headers
and all_headers
. While the former only gives back headers and response meta information about the last response, the latter is a list of headers and response information for all responses.
1 |
|
1 |
|
1 |
|
1 |
|
The solution to the initial problem now can be written down as neat little function which (1) extracts all status codes logged in all_headers
and (2) checks if any of them equals some 3xx
status code (3xx is mainly but not exclusively about redirection but can be considered always suspicious in the problem at hand).
1 |
|
A more specific question in regard to redirects is whether or not a redirectdid not only change the path but also entailed a domain change – robots.txt conventions clearly state that each domain and subdomain have to provide their own robots.txt files.
The following function makes use of the all_headers
item from the response object again. In addition it uses the domain()
function provided by the urltools package to extract the domain part of an URL. If any location header shows a domain not equal to that of the original URL requested a domain change must have happened along the way.
1 |
|
Related