How can I rank observations in-group faster?

Tags:

I have a really simple problem, but I'm probably not thinking vector-y enough to solve it efficiently. I tried two different approaches and they've been looping on two different computers for a long time now. I wish I could say the competition made it more exciting, but ... bleh.

rank observations in group

I have long data (many rows per person, one row per person-observation) and I basically want a variable, that tells me how often the person has been observed already.

I have the first two columns and want the third one:

person  wave   obs
pers1   1999   1
pers1   2000   2
pers1   2003   3
pers2   1998   1
pers2   2001   2

Now I'm using two loop-approaches. Both are excruciatingly slow (150k rows). I'm sure I'm missing something, but my search queries didn't really help me yet (hard to phrase the problem).

Thanks for any pointers!

# ordered dataset by persnr and year of observation
person.obs <- person.obs[order(person.obs$PERSNR,person.obs$wave) , ]

person.obs$n.obs = 0

# first approach: loop through people and assign range
unp = unique(person.obs$PERSNR)
unplength = length(unp)
for(i in 1:unplength) {
   print(unp[i])
   person.obs[which(person.obs$PERSNR==unp[i]),]$n.obs = 
1:length(person.obs[which(person.obs$PERSNR==unp[i]),]$n.obs)
    i=i+1
   gc()
}

# second approach: loop through rows and reset counter at new person
pnr = 0
for(i in 1:length(person.obs[,2])) {
  if(pnr!=person.obs[i,]$PERSNR) { pnr = person.obs[i,]$PERSNR
  e = 0
  }
  e=e+1
  person.obs[i,]$n.obs = e
  i=i+1
  gc()
}

723

asked May 28 '11 15:05

Ruben

2 Answers

The answer from Marek in this question has proven very useful in the past. I wrote it down and use it almost daily since it was fast and efficient. We'll use ave() and seq_along().

foo <-data.frame(person=c(rep("pers1",3),rep("pers2",2)),year=c(1999,2000,2003,1998,2011))

foo <- transform(foo, obs = ave(rep(NA, nrow(foo)), person, FUN = seq_along))
foo

  person year obs
1  pers1 1999   1
2  pers1 2000   2
3  pers1 2003   3
4  pers2 1998   1
5  pers2 2011   2

Another option using plyr

library(plyr)
ddply(foo, "person", transform, obs2 = seq_along(person))

  person year obs obs2
1  pers1 1999   1    1
2  pers1 2000   2    2
3  pers1 2003   3    3
4  pers2 1998   1    1
5  pers2 2011   2    2

131

answered Oct 24 '22 04:10

Chase

A few alternatives with the data.table and dplyr packages.

data.table:

library(data.table)
# setDT(foo) is needed to convert to a data.table

# option 1:
setDT(foo)[, rn := rowid(person)]   

# option 2:
setDT(foo)[, rn := 1:.N, by = person]

both give:

> foo
   person year rn
1:  pers1 1999  1
2:  pers1 2000  2
3:  pers1 2003  3
4:  pers2 1998  1
5:  pers2 2011  2

If you want a true rank, you should use the frank function:

setDT(foo)[, rn := frank(year, ties.method = 'dense'), by = person]

dplyr:

library(dplyr)
# method 1
foo <- foo %>% group_by(person) %>% mutate(rn = row_number())
# method 2
foo <- foo %>% group_by(person) %>% mutate(rn = 1:n())

both giving a similar result:

> foo
Source: local data frame [5 x 3]
Groups: person [2]

  person  year    rn
  (fctr) (dbl) (int)
1  pers1  1999     1
2  pers1  2000     2
3  pers1  2003     3
4  pers2  1998     1
5  pers2  2011     2

answered Oct 24 '22 03:10

Jaap

Related questions
                            
                                Efficient computation of the high order bits of a 32 bit integer multiplication
                            
                                Why doesn't gcc remove this check of a non-volatile variable?
                            
                                Print Statement in SQL procedure should affect Performance?
                            
                                Is there a PL/SQL pragma similar to DETERMINISTIC, but for the scope of one single SQL SELECT?
                            
                                Multiple source file executable slower than single source file executable
                            
                                In C++, which is better i>-1 or i>=0 [duplicate]
                            
                                Scipy.optimize minimize is taking too long
                            
                                Microsoft SQL Server 2005/2008: XML vs text/varchar data type
                            
                                Fast Method for computing 3x3 symmetric matrix spectral decomposition
                            
                                Tail call optimization for fibonacci function in java
                            
                                Is it possible to set the python -O (optimize) flag within a script?
                            
                                Why does this code behave differently if optimizing (-O2, -O3) is used?
                            
                                GHC optimization
                            
                                java linkedlist slower than arraylist when adding elements?
                            
                                Why isn't g++ tail call optimizing while gcc is?
                            
                                MethodImpl(NoOptimization) on this method, what does it do? And is it really nessecary?
                            
                                Cython Numpy code not faster than pure python
                            
                                What's the most efficient way to calculate the warp id / lane id in a 1-D grid?
                            
                                Algorithms for Optimization with Fast Disk Storage (SSDs)?
                            
                                Resetting AUTO_INCREMENT is taking a long time in MySQL

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With