I have a large data set (>100,000 rows) and would like to create a new column that sums all previous values of another column.
For a simulated data set test.data with 100,000 rows and 2 columns, I create the new vector that sums the contents of column 2 with:
sapply(1:100000, function(x) sum(test.data[1:x[1],2]))
I append this vector to the test.table later with cbind() This is too slow, however. Is there a faster way to accomplish this, or be able to reference the vector that sapply is making in sapply so I can just update the cumulative sum instead of performing the whole calc again?
Per my comment above it'll be faster if you do a direct assignment and use cumsum instead of sapply (cumsum was specifically built for what you want to do).
This should work:
test.data$sum <- cumsum(test.data[, 2])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With