Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to determine the file size of a remote download without reading the entire file with R

Tags:

r

Is there a reasonably straightforward way to determine the file size of a remote file without downloading the entire file? Stack Overflow answers how to do this with PHP and curl, so I imagine it's possible in R as well. If possible, I believe it would be better to avoid RCurl, since that requires an additional installation for non-Windows users?

On this survey analysis website, I write lots of scripts to automatically download large data files from government agencies (like the us census bureau and the cdc). I am trying to implement an additional component that will not download a file that has already been downloaded, by creating a "download cache" - but I am concerned that this "download cache" might get corrupted if: 1) the host website changes a file or 2) the user cancels a download midway through. Therefore, when deciding whether to download a file from the source HTTP or FTP site, I want to compare the local file size to the remote file size.. And if they are not the same, download the file again.

like image 513
Anthony Damico Avatar asked Jan 04 '14 13:01

Anthony Damico


1 Answers

Nowadays a straight-forward approach might be

response = httr::HEAD(url)
httr::headers(response)[["Content-Length"]]

My original answer was: A more 'by hand' approach is to set the CURLOPT_NOBODY option (see man curl_easy_setopt on Linux, basically inspired by looking at the answers to the linked question) and tell getURL and friends to return the header along with the request

library(RCurl)
url = "http://stackoverflow.com/questions/20921593/how-to-determine-the-file-size-of-a-remote-download-without-reading-the-entire-f"
xx = getURL(url, nobody=1L, header=1L)
strsplit(xx, "\r\n")

## [[1]]
##  [1] "HTTP/1.1 200 OK"                             
##  [2] "Cache-Control: public, max-age=60"           
##  [3] "Content-Length: 60848"                       
##  [4] "Content-Type: text/html; charset=utf-8"      
##  [5] "Expires: Sat, 04 Jan 2014 14:09:58 GMT"      
##  [6] "Last-Modified: Sat, 04 Jan 2014 14:08:58 GMT"
##  [7] "Vary: *"                                     
##  [8] "X-Frame-Options: SAMEORIGIN"                 
##  [9] "Date: Sat, 04 Jan 2014 14:08:57 GMT"         
## [10] ""                                            

A peak at url.exists suggests parseHTTPHeader(xx) for parsing HTTP headers. getURL also works with ftp URLs.

url = "ftp://ftp2.census.gov/AHS/AHS_2004/AHS_2004_Metro_PUF_Flat.zip"
getURL(url, nobody=1L, header=1L)
## [1] "Content-Length: 21288307\r\nAccept-ranges: bytes\r\n"
like image 152
Martin Morgan Avatar answered Oct 28 '22 11:10

Martin Morgan