Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most efficient way of exporting large (3.9 mill obs) data.frames to text file? [duplicate]

I have a fairly large dataframe in R that I would like to export to SPSS. This file has caused me hours of headaches trying to import it to R in the first place, however I got successful using read.fwf() using the options comment.char="%" (a character not appearing in the file) and fill= TRUE(it was a fixed-width ASCII file with some rows lacking all variables, causing error messages).

Anyway, my data frame currently consists of 3,9 mill observations and 48 variables (all character). I can write it to file fairly quickly by splitting it into 4 x 1 mill obs sets with df2 <- df[1:1000000,] followed by write.table(df2) etc., but can't write the entire file in one sweep without the computer locking up and needing a hard reset to come back up.

After hearing anecdotal stories about how R is unsuited for large datasets for years, this is the first time I have actually encountered a problem of this kind. I wonder whether there are other approaches(low-level "dumping" the file directly to disk?) or whether there is some package unknown to me that can handle export of large files of this type efficiently?

like image 612
jans Avatar asked Mar 14 '12 13:03

jans


People also ask

How do I save a large dataset in R?

To save data as an RData object, use the save function. To save data as a RDS object, use the saveRDS function. In each case, the first argument should be the name of the R object you wish to save. You should then include a file argument that has the file name or file path you want to save the data set to.


2 Answers

1) If your file is all character strings, then it saves using write.table() much faster if you first change it to a matrix.

2) also write it out in chunks of, say 1000000 rows, but always to the same file, and using the argument append = TRUE.

like image 93
tim riffe Avatar answered Oct 18 '22 17:10

tim riffe


Update

After extensive work by Matt Dowle parallelizing and adding other efficiency improvements, fread is now as much as 15x faster than write.csv. See linked answer for more.


Now data.table has an fwrite function contributed by Otto Seiskari which seems to be about twice as fast as write.csv in general. See here for some benchmarks.

library(data.table) 
fwrite(DF, "output.csv")

Note that row names are excluded, since the data.table type makes no use of them.

like image 18
MichaelChirico Avatar answered Oct 18 '22 17:10

MichaelChirico