I have the following dataset:
df<-structure(list(IDFAM = c("2010 7599 2996 1", "2010 7599 3071 1",
"2010 7599 3071 1", "2010 7599 3660 1", "2010 7599 4736 1", "2010 7599 6235 1",
"2010 7599 6299 1", "2010 7599 9903 1", "2010 7599 11013 1",
"2010 7599 11778 1", "2010 7599 11778 1", "2010 7599 12248 1",
"2010 7599 13127 1", "2010 7599 14261 1", "2010 7599 16280 1",
"2010 7599 16280 1", "2010 7599 16280 1", "2010 7599 16280 1",
"2010 7599 16280 1", "2010 7599 17382 1"), AGED = c(45L, 47L,
24L, 46L, 46L, 44L, 43L, 43L, 43L, 16L, 43L, 46L, 44L, 47L, 43L,
16L, 20L, 18L, 18L, 43L)), .Names = c("IDFAM", "AGED"), row.names = c("5614",
"5748", "5753", "6864", "8894", "11761", "11884", "18738", "20896",
"22351", "22353", "23267", "24939", "27072", "30946", "30947",
"30949", "30950", "30952", "33034"), class = "data.frame")
I would like to assign an ID to each observation having the same IDFAM
value ranging from 1 to n, where n is the number of observations with the same value of IDFAM
. This would result in the following table:
IDFAM AGED ID
2010 7599 2996 1 45 1
2010 7599 3071 1 47 1
2010 7599 3071 1 24 2
2010 7599 3660 1 46 1
2010 7599 4736 1 46 1
2010 7599 6235 1 44 1
2010 7599 6299 1 43 1
2010 7599 9903 1 43 1
2010 7599 11013 1 43 1
2010 7599 11778 1 16 1
2010 7599 11778 1 43 2
2010 7599 12248 1 46 1
2010 7599 13127 1 44 1
2010 7599 14261 1 47 1
2010 7599 16280 1 43 1
2010 7599 16280 1 16 2
2010 7599 16280 1 20 3
2010 7599 16280 1 18 4
2010 7599 16280 1 18 5
2010 7599 17382 1 43 1
How can I do this ? Thanks.
There are several ways.
In base R, use ave
:
with(df, ave(rep(1, nrow(df)), IDFAM, FUN = seq_along))
# [1] 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 2 3 4 5 1
With the "data.table" package, use sequence(.N)
:
library(data.table)
DT <- as.data.table(df)
DT[, ID := sequence(.N), by = IDFAM]
With the "dplyr" package, try:
df %>% group_by(IDFAM) %>% mutate(count = sequence(n()))
or (as recommended by Hadley in the comments):
df %>% group_by(IDFAM) %>% mutate(count = row_number(IDFAM))
Since this seems to be something that is asked for relatively frequently, this feature has been added as a function (getanID
) in my "splitstackshape" package. It is based on the "data.table" approach above.
library(splitstackshape)
getanID(df, id.vars = "IDFAM")
# IDFAM AGED .id
# 1: 2010 7599 2996 1 45 1
# 2: 2010 7599 3071 1 47 1
# 3: 2010 7599 3071 1 24 2
# 4: 2010 7599 3660 1 46 1
# 5: 2010 7599 4736 1 46 1
# 6: 2010 7599 6235 1 44 1
# 7: 2010 7599 6299 1 43 1
# 8: 2010 7599 9903 1 43 1
# 9: 2010 7599 11013 1 43 1
# 10: 2010 7599 11778 1 16 1
# 11: 2010 7599 11778 1 43 2
# 12: 2010 7599 12248 1 46 1
# 13: 2010 7599 13127 1 44 1
# 14: 2010 7599 14261 1 47 1
# 15: 2010 7599 16280 1 43 1
# 16: 2010 7599 16280 1 16 2
# 17: 2010 7599 16280 1 20 3
# 18: 2010 7599 16280 1 18 4
# 19: 2010 7599 16280 1 18 5
# 20: 2010 7599 17382 1 43 1
With dplyr 0.5 you can use the group_indices
function. Although it do not support mutate
, the following approach is straightforward:
df$id <- df %>% group_indices(IDFAM)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With