Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to fill column with numbers that identify observations with same value in column [duplicate]

Tags:

r

I apologize for the wording of the question and the errors. Newbie in OS and in R.

Problem: Find efficient way to fill column with numbers that uniquely identify observations with same value in another column. Result would look like this:

    patient_number id
1              46  1
2              47  2
3              15  3
4              42  4
5              33  5
6              26  6
7              37  7
8               7  8
9              33  5
10             36  9

Sample data frame

set.seed(42)
df <- data.frame(
  patient_number = sample(seq(1, 50, 1), 100, replace = TRUE)
)

What I was able to come up with

df$id <- NA  ## create id and fill with NA make if statement easier
n_unique <- length(unique(df$patient_number))  ## how many unique obs

for (i in 1:nrow(df)) {
  index_identical <- which(df$patient_number == df$patient_number[i])
  ## get index of obs with same patient_number

  if (any(is.na(df$id[index_identical]))) {
    ## if any of the ids of obs with same patient number not filled in,
    df$id[index_identical] <- setdiff(seq(1, n_unique, 1), df$id)[1]
    ## get a integer between 1 and the number of unique obs that is not used
  }

  else {
    df$id <- df$id
  }
}

It does the job, but with thousands of rows, it takes time.

Thanks for bearing with me.

like image 223
Pablo Rod Avatar asked Feb 20 '19 15:02

Pablo Rod


2 Answers

If you're open to other packages, you can use the group_indices function from the dplyr package:

library(dplyr)
df %>%
  mutate(id = group_indices(., patient_number))

    patient_number id
1               46 40
2               47 41
3               15 14
4               42 37
5               33 28
6               26 23
7               37 32
8                7  6
9               33 28
10              36 31
11              23 21
12              36 31
13              47 41
...
like image 194
zack Avatar answered Nov 17 '22 08:11

zack


We can use .GRP from data.table

library(data.table)
setDT(df)[, id := .GRP, patient_number]

Or with base R match and factor options are fast as well

df$id <- with(df, match(patient_number, unique(patient_number)))
df$id <- with(df, as.integer(factor(patient_number, 
               levels = unique(patient_number))))
like image 5
akrun Avatar answered Nov 17 '22 06:11

akrun