I have a really simple problem, but I'm probably not thinking vector-y enough to solve it efficiently. I tried two different approaches and they've been looping on two different computers for a long time now. I wish I could say the competition made it more exciting, but ... bleh.
I have long data (many rows per person, one row per person-observation) and I basically want a variable, that tells me how often the person has been observed already.
I have the first two columns and want the third one:
person wave obs
pers1 1999 1
pers1 2000 2
pers1 2003 3
pers2 1998 1
pers2 2001 2
Now I'm using two loop-approaches. Both are excruciatingly slow (150k rows). I'm sure I'm missing something, but my search queries didn't really help me yet (hard to phrase the problem).
Thanks for any pointers!
# ordered dataset by persnr and year of observation
person.obs <- person.obs[order(person.obs$PERSNR,person.obs$wave) , ]
person.obs$n.obs = 0
# first approach: loop through people and assign range
unp = unique(person.obs$PERSNR)
unplength = length(unp)
for(i in 1:unplength) {
print(unp[i])
person.obs[which(person.obs$PERSNR==unp[i]),]$n.obs =
1:length(person.obs[which(person.obs$PERSNR==unp[i]),]$n.obs)
i=i+1
gc()
}
# second approach: loop through rows and reset counter at new person
pnr = 0
for(i in 1:length(person.obs[,2])) {
if(pnr!=person.obs[i,]$PERSNR) { pnr = person.obs[i,]$PERSNR
e = 0
}
e=e+1
person.obs[i,]$n.obs = e
i=i+1
gc()
}
By default, ranks are assigned by ordering the data values in ascending order (smallest to largest), then labeling the smallest value as rank 1. Alternatively, Largest value orders the data in descending order (largest to smallest), and assigns the largest value the rank of 1.
The ranking of a variable in an R data frame can be done by using rank function. For example, if we have a data frame df that contains column x then rank of values in x can be found as rank(df$x).
To partition rows and rank them by their position within the partition, use the RANK() function with the PARTITION BY clause. SQL's RANK() function allows us to add a record's position within the result set or within each partition. In our example, we rank rows within a partition.
The answer from Marek in this question has proven very useful in the past. I wrote it down and use it almost daily since it was fast and efficient. We'll use ave()
and seq_along()
.
foo <-data.frame(person=c(rep("pers1",3),rep("pers2",2)),year=c(1999,2000,2003,1998,2011))
foo <- transform(foo, obs = ave(rep(NA, nrow(foo)), person, FUN = seq_along))
foo
person year obs
1 pers1 1999 1
2 pers1 2000 2
3 pers1 2003 3
4 pers2 1998 1
5 pers2 2011 2
Another option using plyr
library(plyr)
ddply(foo, "person", transform, obs2 = seq_along(person))
person year obs obs2
1 pers1 1999 1 1
2 pers1 2000 2 2
3 pers1 2003 3 3
4 pers2 1998 1 1
5 pers2 2011 2 2
A few alternatives with the data.table and dplyr packages.
data.table:
library(data.table)
# setDT(foo) is needed to convert to a data.table
# option 1:
setDT(foo)[, rn := rowid(person)]
# option 2:
setDT(foo)[, rn := 1:.N, by = person]
both give:
> foo person year rn 1: pers1 1999 1 2: pers1 2000 2 3: pers1 2003 3 4: pers2 1998 1 5: pers2 2011 2
If you want a true rank, you should use the frank
function:
setDT(foo)[, rn := frank(year, ties.method = 'dense'), by = person]
dplyr:
library(dplyr)
# method 1
foo <- foo %>% group_by(person) %>% mutate(rn = row_number())
# method 2
foo <- foo %>% group_by(person) %>% mutate(rn = 1:n())
both giving a similar result:
> foo Source: local data frame [5 x 3] Groups: person [2] person year rn (fctr) (dbl) (int) 1 pers1 1999 1 2 pers1 2000 2 3 pers1 2003 3 4 pers2 1998 1 5 pers2 2011 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With