Is there a reasonably straightforward way to determine the file size of a remote file without downloading the entire file? Stack Overflow answers how to do this with PHP and curl, so I imagine it's possible in R as well. If possible, I believe it would be better to avoid RCurl, since that requires an additional installation for non-Windows users?
On this survey analysis website, I write lots of scripts to automatically download large data files from government agencies (like the us census bureau and the cdc). I am trying to implement an additional component that will not download a file that has already been downloaded, by creating a "download cache" - but I am concerned that this "download cache" might get corrupted if: 1) the host website changes a file or 2) the user cancels a download midway through. Therefore, when deciding whether to download a file from the source HTTP or FTP site, I want to compare the local file size to the remote file size.. And if they are not the same, download the file again.
Nowadays a straight-forward approach might be
response = httr::HEAD(url)
httr::headers(response)[["Content-Length"]]
My original answer was: A more 'by hand' approach is to set the CURLOPT_NOBODY option (see man curl_easy_setopt
on Linux, basically inspired by looking at the answers to the linked question) and tell getURL
and friends to return the header along with the request
library(RCurl)
url = "http://stackoverflow.com/questions/20921593/how-to-determine-the-file-size-of-a-remote-download-without-reading-the-entire-f"
xx = getURL(url, nobody=1L, header=1L)
strsplit(xx, "\r\n")
## [[1]]
## [1] "HTTP/1.1 200 OK"
## [2] "Cache-Control: public, max-age=60"
## [3] "Content-Length: 60848"
## [4] "Content-Type: text/html; charset=utf-8"
## [5] "Expires: Sat, 04 Jan 2014 14:09:58 GMT"
## [6] "Last-Modified: Sat, 04 Jan 2014 14:08:58 GMT"
## [7] "Vary: *"
## [8] "X-Frame-Options: SAMEORIGIN"
## [9] "Date: Sat, 04 Jan 2014 14:08:57 GMT"
## [10] ""
A peak at url.exists
suggests parseHTTPHeader(xx)
for parsing HTTP headers. getURL
also works with ftp URLs.
url = "ftp://ftp2.census.gov/AHS/AHS_2004/AHS_2004_Metro_PUF_Flat.zip"
getURL(url, nobody=1L, header=1L)
## [1] "Content-Length: 21288307\r\nAccept-ranges: bytes\r\n"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With