Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I capture the HTTP error code from a download.file request?

Tags:

http

curl

r

wget

This code attempts to download a page that does not exist:

url <- "https://en.wikipedia.org/asdfasdfasdf"
status_code <- download.file(url, destfile = "output.html", method = "libcurl")

This returns a 404 error:

trying URL 'https://en.wikipedia.org/asdfasdfasdf'
Error in download.file(url, destfile = "output.html", method = "libcurl") : 
  cannot open URL 'https://en.wikipedia.org/asdfasdfasdf'
In addition: Warning message:
In download.file(url, destfile = "output.html", method = "libcurl") :
  cannot open URL 'https://en.wikipedia.org/asdfasdfasdf': HTTP status was '404 Not Found'

but the code variable still contains a 0, even though the documentation for download.file states that the returned value is:

An (invisible) integer code, 0 for success and non-zero for failure. For the "wget" and "curl" methods this is the status code returned by the external program. The "internal" method can return 1, but will in most cases throw an error.

The results are the same if I use curl or wget as the download method. What am I missing here? Is the only option to call warnings() and parse the output?

I've seen other questions about using download.file, but none (that I can find) that actually retrieve the HTTP status code.

like image 858
Michael A Avatar asked Dec 17 '18 21:12

Michael A


People also ask

How do I find the HTTP error code?

Just use Chrome browser. Hit F12 to get developer tools and look at the network tab. Shows you all status codes, whether page was from cache etc.

How do I download a HTTP response?

In general, downloading a file from an HTTP server terminal via HTTP GET consists of the following steps: Make an HTTP GET request to send to the HTTP server. Send an HTTP request and receive an HTTP response from the HTTP server. Save the contents of the HTTP response file to a local file.

What HTTP status code means that the requested file was not found?

If the server does not know, or has no facility to determine, whether or not the condition is permanent, the status code 404 (Not Found) SHOULD be used instead. This response is cacheable unless indicated otherwise.

Which HTTP response status code indicates a successful download?

2xx successful – the request was successfully received, understood, and accepted. 3xx redirection – further action needs to be taken in order to complete the request.


2 Answers

Probably the best option is to use cURL library directly rather than via the download.file wrapper which does not expose the full functionality of cURL. We can do this, for example, using the RCurl package (although other packages such as httr, or system calls can also achieve the same thing). Using cURL directly will allow you to access the cURL Info, including response code. For example:

library(RCurl)
curl = getCurlHandle()
x = getURL("https://en.wikipedia.org/asdfasdfasdf", curl = curl)
write(x, 'output.html')
getCurlInfo(curl)$response.code
# [1] 404

Although the first option above is much cleaner, if you really want to use download.file instead, one possible way would be to capture the warning using withCallingHandlers

try(withCallingHandlers( 
  download.file(url, destfile = "output.html", method = "libcurl"),
  warning = function(w) {
    my.warning <<- sub(".+HTTP status was ", "", w)
    }),
  silent = TRUE)

cat(my.warning)
'404 Not Found'
like image 131
dww Avatar answered Sep 20 '22 05:09

dww


If you don't mind using a different method you can try GET from the httr package:

url_200 <- "https://en.wikipedia.org/wiki/R_(programming_language)"
url_404 <- "https://en.wikipedia.org/asdfasdfasdf"

# OK
raw_200 <- httr::GET(url_200)
raw_200$status_code
#> [1] 200

# Not found
raw_404 <- httr::GET(url_404)
raw_404$status_code
#> [1] 404

Created on 2019-01-02 by the reprex package (v0.2.1)

like image 40
Birger Avatar answered Sep 19 '22 05:09

Birger