Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does it take more time for data.table::fread to read a file when filename is specified differently?

Tags:

I'm reading a file into R using fread using below methods:

fread("file:///C:/Users/Desktop/ads.csv")   fread("C:/Users/Desktop/ads.csv")       # Just omitted "file:///"   

I've observed the runtime to be very different:

microbenchmark(   fread("file:///C:/Users/Desktop/ads.csv"),   fread("C:/Users/Desktop/ads.csv") )  Unit: microseconds                           expr               min        lq      mean     median       uq       max    neval cld fread("file:///C:/Users/Desktop/ads.csv") 5755.975 6027.4735 6696.7807 6235.3365 6506.652 41257.476   100   b   fread("C:/Users/Desktop/ads.csv")          525.492  584.0215  673.7166  647.4745  727.703  1476.191   100   a    

Why does the run-time vary so much? There isn't noticeable difference between 2 variants when I was using read.csv() though

like image 786
Ashrith Reddy Avatar asked Mar 19 '18 07:03

Ashrith Reddy


People also ask

Is fread faster than read CSV?

For files beyond 100 MB in size fread() and read_csv() can be expected to be around 5 times faster than read. csv() .

Is fread fast?

Conclusion: For sequential access, both fread and ifstream are equally fast.

What does fread mean in R?

table package is an extremely useful and easy to use. Its fread() function is meant to import data from regular delimited files directly into R, without any detours or nonsense. Note that “regular” in this case means that every row of your data needs to have the same number of columns.

What package is fread?

table package comes with a function called fread which is a very efficient and speedy function for reading data from files. It is similar to read. table but faster and more convenient.


1 Answers

Update:

The following has been added to ?fread:

When input begins with http://, https://, ftp://, ftps://, or file://, fread detects this and downloads the target to a temporary file (at tempfile()) before proceeding to read the file as usual. Secure URLS (ftps:// and https://) are downloaded with curl::curl_download; ftp:// and http:// paths are downloaded with download.file and method set to getOption("download.file.method"), defaulting to "auto"; and file:// is downloaded with download.file with method="internal". NB: this implies that for file://, even files found on the current machine will be "downloaded" (i.e., hard-copied) to a temporary file. See ?download.file for more details.


From the source of fread:

if (str6 == "ftp://" || str7 == "http://" || str7 == "file://") {   method = if (str7 == "file://") "auto"            else getOption("download.file.method", default = "auto")   download.file(input, tmpFile, method = method, mode = "wb", quiet = !showProgress) } 

That is, your file is being "downloaded" to a temporary file, which should consist of deep-copying the contents of the file to a temporary location. file:// is not really intended for use on local files, but on files in a network that need to be downloaded locally before being read (IIUC; FWIW, this is what fread's testing regime uses to imitate file download while testing on CRAN, where external file download is impossible).

I also notice that your timings are on the order of microseconds, which could explain the discrepancy vs. read.csv. Imagine read.csv takes 1 second to read the file, while fread takes .01 seconds; file copying takes .05 seconds. Then in both cases read.csv will look about the same (1 vs 1.05 seconds), while fread looks substantially slower for the file:// case (.01 vs. .06 seconds).

like image 188
MichaelChirico Avatar answered Sep 28 '22 07:09

MichaelChirico