I have a program that outputs lines of CSV data that I want to load into a data frame. I currently load the data like so:
tmpFilename <- "tmp_file"
system(paste(procName, ">", tmpFilename), wait=TRUE)
myData <- read.csv(tmpFilename) # (I also pass in colClasses and nrows for efficiency)
However, I thought redirecting the output to a file just to read from it was inefficient (the program spits out about 30MB, so I want to handle it with optimal performance). I thought textConnection
would solve this, so I tried:
con <- textConnection(system(procName, intern=TRUE))
myData <- read.csv(con)
This runs a lot slower, though, and whereas the first solution degrades linearly with input size, the textConnection
solution's performance degrades exponentially it seems. The slowest part is creating the textConnection
. read.csv
here actually completes quicker than in the first solution since it's reading from memory.
My question is then, is creating a file just to run read.csv
on it my best option with respect to speed? Is there a way to speed up the creation of a textConnection? bonus: why is creating a textConnection so slow?
The "fastest way" will probably involve using something other than read.csv. However, sticking with read.csv, using pipe may be the way to go:
myData <- read.csv(pipe(procName))
It avoids reading the full text output into an intermediate buffer (at least before read.csv gets ahold of it).
Some timing comparisons:
> write.csv(data.frame(x=rnorm(1e5)), row.names=FALSE, file="norm.csv")
> system.time(d <- read.csv("norm.csv"))
user system elapsed
0.398 0.004 0.402
> system.time(d <- read.csv(textConnection(system("cat norm.csv", intern=TRUE))))
user system elapsed
56.159 0.106 56.095
> system.time(d <- read.csv(pipe("cat norm.csv")))
user system elapsed
0.475 0.012 0.531
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With