I recently asked a question about counting the number of times an element had repeated itself (http://stackoverflow.com/questions/7669553/how-to-assign-number-of-repeats-to-dataframe-based-on-elements-of-an-identifying/7669607#7669607) in a large data-frame. I received some very helpful advice, which worked on a small number of rows, but now need to perform the operation on a much larger level (over 255k rows, with around 100k "groups" being formed using ddply):
system.time( data <- ddply(data, "uid", function(x) {x$time <- 1:nrow(x); x}) ) #uid is the grouping variable, for which I need to count the number of repeats for output like
uid time
ny1 1
ny1 2
ny2 1
ny2 2
ny2 3
Trying to perform this operation on the larger data set results in R choking due to memory issues. Are there any obvious solutions to this? Thanks in advance (especially for patience as I'm a new "programmer").
For truly large problems like this, you might try using data.tables rather than plyr:
library(data.table)
data <- data.table(data)
data[,transform(.SD,time = NROW(.SD)), by = uid]
assuming the time column doesn't already exist.
I'm still in the process of learning data.table, so as I tinker with this it appears this may be simpler (and maybe faster):
data[,rep(.N, .N),by = uid]
.N appears to an internal variable that represents the number of rows of each subgroup.
I posted a new answer to your original question here How to assign number of repeats to dataframe based on elements of an identifying vector in R?.
That will hopefully help you there and here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With