This code attempts to download a page that does not exist:
url <- "https://en.wikipedia.org/asdfasdfasdf"
status_code <- download.file(url, destfile = "output.html", method = "libcurl")
This returns a 404 error:
trying URL 'https://en.wikipedia.org/asdfasdfasdf'
Error in download.file(url, destfile = "output.html", method = "libcurl") :
cannot open URL 'https://en.wikipedia.org/asdfasdfasdf'
In addition: Warning message:
In download.file(url, destfile = "output.html", method = "libcurl") :
cannot open URL 'https://en.wikipedia.org/asdfasdfasdf': HTTP status was '404 Not Found'
but the code
variable still contains a 0, even though the documentation for download.file
states that the returned value is:
An (invisible) integer code, 0 for success and non-zero for failure. For the "wget" and "curl" methods this is the status code returned by the external program. The "internal" method can return 1, but will in most cases throw an error.
The results are the same if I use curl
or wget
as the download method. What am I missing here? Is the only option to call warnings()
and parse the output?
I've seen other questions about using download.file
, but none (that I can find) that actually retrieve the HTTP status code.
Just use Chrome browser. Hit F12 to get developer tools and look at the network tab. Shows you all status codes, whether page was from cache etc.
In general, downloading a file from an HTTP server terminal via HTTP GET consists of the following steps: Make an HTTP GET request to send to the HTTP server. Send an HTTP request and receive an HTTP response from the HTTP server. Save the contents of the HTTP response file to a local file.
If the server does not know, or has no facility to determine, whether or not the condition is permanent, the status code 404 (Not Found) SHOULD be used instead. This response is cacheable unless indicated otherwise.
2xx successful – the request was successfully received, understood, and accepted. 3xx redirection – further action needs to be taken in order to complete the request.
Probably the best option is to use cURL library directly rather than via the download.file
wrapper which does not expose the full functionality of cURL. We can do this, for example, using the RCurl package (although other packages such as httr, or system calls can also achieve the same thing). Using cURL directly will allow you to access the cURL Info, including response code. For example:
library(RCurl)
curl = getCurlHandle()
x = getURL("https://en.wikipedia.org/asdfasdfasdf", curl = curl)
write(x, 'output.html')
getCurlInfo(curl)$response.code
# [1] 404
Although the first option above is much cleaner, if you really want to use download.file
instead, one possible way would be to capture the warning using withCallingHandlers
try(withCallingHandlers(
download.file(url, destfile = "output.html", method = "libcurl"),
warning = function(w) {
my.warning <<- sub(".+HTTP status was ", "", w)
}),
silent = TRUE)
cat(my.warning)
'404 Not Found'
If you don't mind using a different method you can try GET
from the httr
package:
url_200 <- "https://en.wikipedia.org/wiki/R_(programming_language)"
url_404 <- "https://en.wikipedia.org/asdfasdfasdf"
# OK
raw_200 <- httr::GET(url_200)
raw_200$status_code
#> [1] 200
# Not found
raw_404 <- httr::GET(url_404)
raw_404$status_code
#> [1] 404
Created on 2019-01-02 by the reprex package (v0.2.1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With