Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Problems with ddply for splitting a large number of categories in R

I recently asked a question about counting the number of times an element had repeated itself (http://stackoverflow.com/questions/7669553/how-to-assign-number-of-repeats-to-dataframe-based-on-elements-of-an-identifying/7669607#7669607) in a large data-frame. I received some very helpful advice, which worked on a small number of rows, but now need to perform the operation on a much larger level (over 255k rows, with around 100k "groups" being formed using ddply):

system.time( data <- ddply(data, "uid", function(x) {x$time <- 1:nrow(x); x}) ) #uid is the grouping variable, for which I need to count the number of repeats for output like

uid    time
ny1    1
ny1    2
ny2    1
ny2    2
ny2    3

Trying to perform this operation on the larger data set results in R choking due to memory issues. Are there any obvious solutions to this? Thanks in advance (especially for patience as I'm a new "programmer").

like image 597
SMM Avatar asked Dec 18 '25 10:12

SMM


2 Answers

For truly large problems like this, you might try using data.tables rather than plyr:

library(data.table)
data <- data.table(data)

data[,transform(.SD,time = NROW(.SD)), by = uid]

assuming the time column doesn't already exist.

I'm still in the process of learning data.table, so as I tinker with this it appears this may be simpler (and maybe faster):

data[,rep(.N, .N),by = uid]

.N appears to an internal variable that represents the number of rows of each subgroup.

like image 81
joran Avatar answered Dec 20 '25 01:12

joran


I posted a new answer to your original question here How to assign number of repeats to dataframe based on elements of an identifying vector in R?.

That will hopefully help you there and here.

like image 39
nzcoops Avatar answered Dec 20 '25 01:12

nzcoops



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!