Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R fread data.table inconsistent speed

Tags:

r

data.table

I am observing an inconsistent speed of data.table of fread function. I have to 2 files of ~8 GB size. The content of the files are (almost) same. Time to read two files are strangely different.

 control.major  <-  fread("control.major.gff")$V6
 Read 19.8% of 98100000 rows
 Read 98100000 rows and 10 (of 10) columns from 7.947 GB file in 02:06:58
 control.minor  <-  fread("control.minor.gff")$V6  
 Read 98100000 rows and 10 (of 10) columns from 7.947 GB file in 00:03:15

I have to read 6th column of the files which are all numeric. Initially I found that fread was faster compared to

 scan(pipe("cut -f6  SNP.major.gff"),  sep="\n")

Because cut function was taking awful lot of time.

Why there is inconsistent behavior of fread? Is there a faster way to read one column?

like image 359
vinash85 Avatar asked Jul 11 '14 12:07

vinash85


People also ask

Why is fread so slow?

read. csv(filename) without any other arguments is slow mainly because it first reads everything into memory as if it were character and then attempts to coerce that to integer or numeric as a second step.

Is fread faster than read table?

Not only was fread() almost 2.5 times faster than readr's functionality in reading and binding the data, but perhaps even more importantly, the maximum used memory was only 15.25 GB, compared to readr's 27 GB.

Is fread faster than read CSV?

You can use the fread() function from the data. table package in R to import files quickly and conveniently. For large files, this function has been shown to be significantly faster than functions like read. csv from base R.

Why are data tables so fast?

There are a number of reasons why data. table is fast, but a key one is that unlike many other tools, it allows you to modify things in your table by reference, so it is changed in-situ rather than requiring the object to be recreated with your modifications.


1 Answers

Why did it take 2 hours to load?

Please run it again with verbose=TRUE and include the full output in the question. Maybe the operating system put it in the background while something else ran, or something like that. Did your laptop suspend or hibernate in that time? Please also include the output of sessionInfo().

Is there a faster way to read one column?

Yes. You can pass a vector of column names or positions to the select argument. See ?fread. But I suspect the two issues are unrelated.

like image 104
Matt Dowle Avatar answered Sep 29 '22 04:09

Matt Dowle