I am observing an inconsistent speed of data.table of fread function. I have to 2 files of ~8 GB size. The content of the files are (almost) same. Time to read two files are strangely different.
control.major <- fread("control.major.gff")$V6
Read 19.8% of 98100000 rows
Read 98100000 rows and 10 (of 10) columns from 7.947 GB file in 02:06:58
control.minor <- fread("control.minor.gff")$V6
Read 98100000 rows and 10 (of 10) columns from 7.947 GB file in 00:03:15
I have to read 6th column of the files which are all numeric. Initially I found that fread was faster compared to
scan(pipe("cut -f6 SNP.major.gff"), sep="\n")
Because cut function was taking awful lot of time.
Why there is inconsistent behavior of fread? Is there a faster way to read one column?
read. csv(filename) without any other arguments is slow mainly because it first reads everything into memory as if it were character and then attempts to coerce that to integer or numeric as a second step.
Not only was fread() almost 2.5 times faster than readr's functionality in reading and binding the data, but perhaps even more importantly, the maximum used memory was only 15.25 GB, compared to readr's 27 GB.
You can use the fread() function from the data. table package in R to import files quickly and conveniently. For large files, this function has been shown to be significantly faster than functions like read. csv from base R.
There are a number of reasons why data. table is fast, but a key one is that unlike many other tools, it allows you to modify things in your table by reference, so it is changed in-situ rather than requiring the object to be recreated with your modifications.
Why did it take 2 hours to load?
Please run it again with verbose=TRUE
and include the full output in the question. Maybe the operating system put it in the background while something else ran, or something like that. Did your laptop suspend or hibernate in that time? Please also include the output of sessionInfo()
.
Is there a faster way to read one column?
Yes. You can pass a vector of column names or positions to the select
argument. See ?fread
. But I suspect the two issues are unrelated.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With