Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create a C-level file handle in RCurl for writing downloaded files

Tags:

curl

r

rcurl

In RCurl a function and a class CFILE is defined to work with C-level file handles. From the manual:

The intent is to be able to pass these to libcurl as options so that it can read or write from or to the file. We can also do this with R connections and specify callback functions that manipulate these connections. But using the C-level FILE handle is likely to be significantly faster for large files.

There are no examples related to downloads so I tried:

library(RCurl)
u = "http://cran.r-project.org/web/packages/RCurl/RCurl.pdf"
f = CFILE("RCurl.pdf", mode="wb")
ret= getURL(u,  write = getNativeSymbolInfo("R_curl_write_binary_data")$address,
                file  = f@ref)

I also tried by replacing the file option with writedata = f@ref. The file is downloaded but it is corrupted. Writing custom callback for the write argument works only for non-binary data.

Any idea to download a binary file straight to disk (without loading it in memory) in RCurl?

like image 650
antonio Avatar asked Mar 17 '13 00:03

antonio


2 Answers

I think you want to use writedata and remember to close the file

library(RCurl)
filename <- tempfile()
f <- CFILE(filename, "wb")
url <- "http://cran.fhcrc.org/Rlogo.jpg"
curlPerform(url = url, writedata = f@ref)
close(f)

For more elaborate writing, I'm not sure if this is the best way, but Linux tells me, from

man curl_easy_setopt

that there's a curl option CURL_WRITEFUNCTION that is a pointer to a C function with prototype

size_t function(void *ptr, size_t  size, size_t nmemb, void *stream);

and in R at the end of ?curlPerform there's an example of calling a C function as the 'writefunction' option. So I created a file curl_writer.c

#include <stdio.h>

size_t
writer(void *buffer, size_t size, size_t nmemb, void *stream)
{
    fprintf(stderr, "<writer> size = %d, nmemb = %d\n",
            (int) size, (int) nmemb);
    return size * nmemb;
}

Compiled it

R CMD SHLIB curl_writer.c

which on Linux produces a file curl_writer.so, and then in R

dyn.load("curl_writer.so")
writer <- getNativeSymbolInfo("writer", PACKAGE="curl_writer")$address
curlPerform(URL=url, writefunction=writer)

and get on stderr

<writer> size = 1, nmemb = 2653
<writer> size = 1, nmemb = 520
OK 

These two ideas can be integrated, i.e., writing to an arbitrary file using an arbitrary function, by modifying the C function to use the FILE * we pass in, as

#include <stdio.h>

size_t
writer(void *buffer, size_t size, size_t nmemb, void *stream)
{
    FILE *fout = (FILE *) stream;
    fprintf(fout, "<writer> size = %d, nmemb = %d\n",
            (int) size, (int) nmemb);
    fflush(fout);
    return size * nmemb;
}

and then back in R after compiling

dyn.load("curl_writer.so")
writer <- getNativeSymbolInfo("writer", PACKAGE="curl_writer")$address
f <- CFILE(filename <- tempfile(), "wb")
curlPerform(URL=url, writedata=f@ref, writefunction=writer)
close(f)

getURL can be used here, too, provided writedata=f@ref, write=writer; I think the problem in the original question is that R_curl_write_binary_data is really an internal function, writing to a buffer managed by RCurl, rather than a file handle like that created by CFILE. Likewise, specifying writedata without write (which seems from the source code to getURL to be an alias for writefunction) sends a pointer to a file to a function expecting a pointer to something else; for getURL both writedata and write need to be provided.

like image 113
Martin Morgan Avatar answered Oct 05 '22 23:10

Martin Morgan


I am working on this problem as well and don't have an answer, yet.

However, I did find this:

http://curl.haxx.se/libcurl/c/curl_easy_setopt.html#CURLOPTWRITEDATA

Are you working on R under Windows? I am.

This documentation for the writedata function indicates that on windows, you must use writefunction along with writedata.

Reading here: http://www.omegahat.org/RCurl/RCurlJSS.pdf I found that RCurl expects the writefunction to be an R function, so we can implement that ourselves on windows. It is going to be slower than using a C function to write the data, however I bet that the speed of the network link will be the bottleneck.

getURI(url="sftp://hostname/home/me/onegeebee", curl=con, write=function(x) writeChar(x, f, eos=NULL))
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) : embedded nul in string: ' <`á\017_\021

(This is after creating a 1GB file on the server to test transfer speed)

I haven't yet found an answer that doesn't choke on NUL bytes in the data. It seems that somewhere in the bowels of the RCurl package when it's passing data up into R to execute the writefunction you supply, it tries to convert the data into a character string. It must not do that if you use a C function. Notably using the recommended R_curl_write_binary_data callback along with CFILE kills rsession.exe on win32 every time for me.

like image 43
Keith Twombley Avatar answered Oct 05 '22 23:10

Keith Twombley