Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Undocumented RCurl "progressfunction" with URL redirection

Consider this simple RCurl function to report download progress:

library(RCurl)
curlDown=function(url, follow=TRUE){
    x=getURL(url, followlocation=follow, noprogress = FALSE,
        progressfunction=function(down,up) cat(down, '\n'))    
}

Note that with followlocation=TRUE (default) we accept to follow the possible redirect location that the server sends as part of the HTTP header.

We get:

curlDown("http://www.example.com")
# 0 0 
# 1270 1079 
# 1270 1127 
# 1270 1270 
# 1270 1270 
# 1270 1270 

As you can see the down variable passed to the callback by RCurl is a numeric vector, where the first element is the total download in bytes and the second is the running download size. Due to space constraints, I don't show this here, but upon separate inspection I saw the former is equivalent to the Content-Length field in the response header.

Not every server gives the Content-Length field in the response header:

curlDown("http://www.google.it")
# 0 0  
# 0 603
# ... blah blah
# 0 44848 
# 0 44848 

In this case RCurl sets the missing total value to zero (would NA have been better?).

Main Google domain, ".com" redirects to a country specific domain, for example ".it" if you are querying from the country associated with this domain (Italy). If you are physical located in the '.it'-domain, you get:

curlDown("http://www.google.com")
# 0 0 
# 274 274 
# 274 274 
# 274 274 
# 274 0 
# 274 0 
# 274 603
# ... blah blah
# 274 44896 
# 274 44896 

These results are strange. If you compare the running download values with the previous curlDown("http://www.google.it"), you understand that after the redirect, the values are the same, as you expected; but the total is smaller than the running download!

To understand the problem we do not follow the redirect location:

curlDown("http://www.google.com", follow=FALSE)
# 0 0 
# 274 274 
# 274 274 
# 274 274 

The main domain server .com sends the Content-Length, 274 bytes, while the redirected server does not (see the zero's in curlDown("http://www.google.it").

The problem is that, after redirection, RCurl does not update the value for the total download size (to zero for the case of unknown size), which remains stacked to the wrong value of 274 bytes.

Is this a BUG or am I missing something?

like image 901
antonio Avatar asked Mar 12 '26 08:03

antonio


1 Answers

I think Rcurl is faithfully forwarding the values from curl, e.g., as documented on curl_set_easyopt under CURLOPT_PROGRESSFUNCTION missing values are returned as 0. If there's a bug then it's with curl. Here's a simple program (see here to get going)

#include <stdio.h>
#include <curl/curl.h>

curl_progress_callback progress(void *clientp, double dltotal, double dlnow,
                                double ultotal, double ulnow)
{
    fprintf(stderr, "PROGRESS: %.0f %.0f %.0f %.0f\n",
            dltotal, dlnow, ultotal, ulnow);
    return 0;
}

int main(int argc, char **argv)
{
    CURL *curl;
    CURLcode res;

    curl = curl_easy_init();
    curl_easy_setopt(curl, CURLOPT_URL, argv[1]);
    curl_easy_setopt(curl, CURLOPT_FOLLOWLOCATION, 1L);
    curl_easy_setopt(curl, CURLOPT_NOPROGRESS, 0L);
    curl_easy_setopt(curl, CURLOPT_PROGRESSFUNCTION, progress);
    res = curl_easy_perform(curl);
    curl_easy_cleanup(curl);

    return 0;
}

and it's evaluation

$ clang curl.c -lcurl && ./a.out http://google.com > /dev/null
PROGRESS: 0 0 0 0
PROGRESS: 0 0 0 0
PROGRESS: 219 219 0 0
PROGRESS: 219 219 0 0
PROGRESS: 219 219 0 0
PROGRESS: 219 219 0 0
PROGRESS: 219 0 0 0
PROGRESS: 219 2097 0 0
PROGRESS: 219 6441 0 0
PROGRESS: 219 12233 0 0
PROGRESS: 219 20921 0 0
PROGRESS: 219 32505 0 0
PROGRESS: 219 45360 0 0
PROGRESS: 219 45360 0 0
PROGRESS: 219 45360 0 0
like image 118
Martin Morgan Avatar answered Mar 13 '26 23:03

Martin Morgan



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!