Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - Group by variable and then assign a unique ID [duplicate]

Tags:

r

dplyr

I am interested in de-identifying a sensitive data set with both time-fixed and time-variant values. I want to (a) group all cases by social security number, (b) assign those cases a unique ID and then (c) remove the social security number.

Here's an example data set:

personal_id    gender  temperature 111-11-1111      M        99.6 999-999-999      F        98.2 111-11-1111      M        97.8 999-999-999      F        98.3 888-88-8888      F        99.0 111-11-1111      M        98.9 

Any solutions would be very much appreciated.

like image 678
B Victor Avatar asked Sep 22 '16 23:09

B Victor


People also ask

Can you group by 2 variables in R?

By using group_by() function from dplyr package we can perform group by on multiple columns or variables (two or more columns) and summarise on multiple columns for aggregations.

How does group_by work in R?

The group_by() function in R is from dplyr package that is used to group rows by column values in the DataFrame, It is similar to GROUP BY clause in SQL. R dplyr groupby is used to collect identical data into groups on DataFrame and perform aggregate functions on the grouped data.

What is an id variable in R?

An ID variable is a variable that identifies each entity in a dataset (person, household, etc) with a distinct value. This article lists five properties of ID variables that researchers should keep in mind when creating, collecting, and merging data.


2 Answers

dplyr has a group_indices function for creating unique group IDs

library(dplyr) data <- data.frame(personal_id = c("111-111-111", "999-999-999", "222-222-222", "111-111-111"),                        gender = c("M", "F", "M", "M"),                        temperature = c(99.6, 98.2, 97.8, 95.5))  data$group_id <- data %>% group_indices(personal_id)  data <- data %>% select(-personal_id)  data   gender temperature group_id 1      M        99.6        1 2      F        98.2        3 3      M        97.8        2 4      M        95.5        1 

Or within the same pipeline (https://github.com/tidyverse/dplyr/issues/2160):

data %>%      mutate(group_id = group_indices(., personal_id)) 
like image 157
conor Avatar answered Sep 29 '22 15:09

conor


dplyr::group_indices() is deprecated as of dplyr 1.0.0. dplyr::cur_group_id() should be used instead:

df %>%  group_by(personal_id) %>%  mutate(group_id = cur_group_id())    personal_id gender temperature group_id   <chr>       <chr>        <dbl>    <int> 1 111-11-1111 M             99.6        1 2 999-999-999 F             98.2        3 3 111-11-1111 M             97.8        1 4 999-999-999 F             98.3        3 5 888-88-8888 F             99          2 6 111-11-1111 M             98.9        1 
like image 36
tmfmnk Avatar answered Sep 29 '22 15:09

tmfmnk