I am working with RMongoDB and I need to fill an empty data.frame with the values of a query. The results are quite long, about 2 milion documents (rows).
While I was doing performance tests, I found out that the time for writing the values to a row increases by the dimension of the data frame. Maybe it is a well known issue and I am the last one to notice it.
Some code example:
set.seed(20140430)
nreg <- 2e3
dfres <- as.data.frame(matrix(rep(NA,nreg*7),nrow=nreg,ncol=7))
system.time(dfres[1e3,] <- c(1:5,"a","b"))
summary(replicate(10,system.time(dfres[sample(1:nreg,1),] <- c(1:5,"a","b"))[3]))
nreg <- 2e6
dfres <- as.data.frame(matrix(rep(NA,nreg*7),nrow=nreg,ncol=7))
system.time(dfres[1e3,] <- c(1:5,"a","b"))
summary(replicate(10,system.time(dfres[sample(1:nreg,1),] <- c(1:5,"a","b"))[3]))
On my machine, the assignment at the 2 milion rows data.frame takes about 0.4 seconds. This is a lot of time if I want to fill the whole dataset. Here goes a second simulation in order to draw the issue.
nreg <- seq(2e1,2e7,length.out=10)
te <- NULL
for(i in nreg){
dfres <- as.data.frame(matrix(rep(NA,i*7),nrow=i,ncol=7))
te <- c(te,mean(replicate(10,{r <- sample(1:i,1); system.time(dfres[r,] <- c(1:5,"a","b"))[3]}) ) )
}
plot(nreg,te,xlab="Number of rows",ylab="Avg. time for 10 random assignments [sec]",type="o")
#rm(nreg,dfres,te)
Question: Why this happens? Is there a quicker way to fill the data.frame in memory?
Let's start with "columns" first and see what goes on and then return to rows.
R versions < 3.1.0 (unnecessarily) copies the entire data.frame
when you operate on them. For example:
## R v3.0.3
df <- data.frame(x=1:5, y=6:10)
dplyr:::changes(df, transform(df, z=11:15)) ## requires dplyr to be available
# Changed variables:
# old new
# x 0x7ff9343fb4d0 0x7ff9326dfba8
# y 0x7ff9343fb488 0x7ff9326dfbf0
# z <added> 0x7ff9326dfc38
# Changed attributes:
# old new
# names 0x7ff934170c28 0x7ff934308808
# row.names 0x7ff934551b18 0x7ff934308970
# class 0x7ff9346c5278 0x7ff935d1d1f8
You can see that addition of "new" column has resulted in a copy of the "old" columns (the addresses are different). Also the attributes are copied. What bites most is that these copies are deep copies, as opposed to shallow copies.
Shallow copies only copy the vector of column pointers, not the entire data, where as deep copies copy everything (which is unnecessary here).
However, in R v3.1.0, there has been nice welcoming changes in that the "old" columns are not deep copied. All credits to the R core dev team.
## R v3.1.0
df <- data.frame(x=1:5, y=6:10)
dplyr:::changes(df, transform(df, z=11:15)) ## requires dplyr to be available
# Changed variables:
# old new
# z <added> 0x7f85d328dda8
# Changed attributes:
# old new
# names 0x7f85d1459548 0x7f85d297bec8
# row.names 0x7f85d2c66cd8 0x7f85d2bfa928
# class 0x7f85d345cab8 0x7f85d2d6afb8
You can see that the columns x
and y
aren't changed at all (and therefore not present in the output of changes
function call). This is a huge (and welcoming) improvement!
So far, we looked at the issue in adding columns in R <3.1.0 and v3.1.0
Now, coming to your question: so, what about the "rows"? Let's consider older version of R first and then come back to R v3.1.0.
## R v3.0.3
df <- data.frame(x=1:5, y=6:10)
df.old <- df
df$y[1L] <- -6L
dplyr:::changes(df.old, df)
# Changed variables:
# old new
# x 0x7f968b423e50 0x7f968ac6ba40
# y 0x7f968b423e98 0x7f968ac6bad0
#
# Changed attributes:
# old new
# names 0x7f968ab88a28 0x7f968abca8e0
# row.names 0x7f968abb6438 0x7f968ab22bb0
# class 0x7f968ad73e08 0x7f968b580828
Once again we see that changing column y
has resulted in copying column x
as well in older versions of R.
## R v3.1.0
df <- data.frame(x=1:5, y=6:10)
df.old <- df
df$y[1L] <- -6L
dplyr:::changes(df.old, df)
# Changed variables:
# old new
# y 0x7f85d3544090 0x7f85d2c9bbb8
#
# Changed attributes:
# old new
# row.names 0x7f85d35a69a8 0x7f85d35a6690
We see the nice improvements in R v3.1.0 which has resulted in the copy of just column y
. Once again, great improvements in R v3.1.0! R's copy-on-modify has gotten wiser.
But still, using
data.table
's assignment by reference semantics, we can do one step better - not copy even they
column as is the case in R v3.1.0.The idea being: as long as the type of the object you assign to a column at certain indices don't change (here, column
y
is integer - so as long as you assign an integer back toy
), we really can do it without having to copy by modifying in-place (by reference).Why? Because we don't have to allocate/re-allocate anything here. As an example, if you had assigned a double/numeric type, which requires 8 bytes of storage as opposed to 4-bytes of storage for integer column
y
, then we've to create a new columny
and copy values back.
That is, we can sub-assign by reference using data.table
. We can use either :=
or set()
to do this. I'll demonstrate using set()
here.
Now, here's a comparison with base R and data.table
on your data with 2,000 to 20,000,000 rows in multiples of 10, against R v3.0.3 and v3.1.0 separately. You can find the code here.
Plot for comparison against R v3.0.3:
Plot for comparison against R v3.1.0:
The min, median and max for R v3.0.3, R v3.1.0 and data.table on 20 million rows with 10 replications are:
type min median max
base_3.0.3 10.05 10.70 18.51
base_3.1.0 1.67 1.97 5.20
data.table 0.04 0.04 0.05
Note: You can see the complete timings in this gist.
This clearly shows the improvement in R v3.1.0, but also shows that the column which is being changed is still being copied and that still consumes sometime, which is overcome through sub-assignment by reference in data.table
.
HTH
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With