Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

return ID's of unique combinations

My data table has the following format

ID   Var1   Var2   Var3   ...
1_1  0      0      1      ...
1_2  1      1      0      ...
1_3  0      0      1      ...
...  ...    ...    ...    ...

I want to extract the ID's from unique combinations (Varcolumns). Getting the unique combinations is not the problem (plyr::count(), aggregate() etc), I want to extract the id variables contributing to these unique combinations.

The output should look somewhat like this

Var1   Var2   Var3   IDs
0      0      1      1_1, 1_3
1      1      0      1_2

where the IDs column is a vector/list of all the ID's contributing to a unique combination.

I tried an R package and dplyr pipelines, nothing worked so far.

Any suggestions or even R packages how to handle this task?

Thank you!

like image 360
tebiwankenebi Avatar asked Oct 29 '19 13:10

tebiwankenebi


3 Answers

You can use group_by_at with the pattern that matches your column names, and summarise, i.e.

df %>% 
 group_by_at(vars(contains('Var'))) %>% 
 summarise(IDs = toString(ID))

which gives,

# A tibble: 2 x 4
# Groups:   Var1, Var2 [2]
   Var1  Var2  Var3 IDs     
  <int> <int> <int> <chr>   
1     0     0     1 1_1, 1_3
2     1     1     0 1_2     
like image 176
Sotos Avatar answered Sep 18 '22 10:09

Sotos


df %>% group_by_at(.vars=-1) %>% summarize(IDs=list(ID))

Similar to Sotos' solution, but simplifies selection of the ID column assuming all other columns need to be unique, and IDs column will be a column of lists rather than a string.

# A tibble: 2 x 4
# Groups:   Var1, Var2 [2]
   Var1  Var2  Var3 IDs      
  <int> <int> <int> <list>   
1     0     0     1 <chr [2]>
2     1     1     0 <chr [1]>

Just for fun, you can further simplify it using tidyr's nest function:

require(tidyr)
nest(df,IDs=ID)
# A tibble: 2 x 4
   Var1  Var2  Var3 IDs                
  <int> <int> <int> <S3: vctrs_list_of>
1     0     0     1 1_1, 1_3           
2     1     1     0 1_2   

This still leaves IDs as a list, which may or may not be useful for you, but displays it more clearly in the tibble. An extra benefit of keeping the column as a list rather than a string is that you can easily recreate the original table using unnest:

unnest(nest(dd,IDs=ID),cols=IDs)
# A tibble: 3 x 4
   Var1  Var2  Var3 ID   
  <int> <int> <int> <chr>
1     0     0     1 1_1  
2     0     0     1 1_3  
3     1     1     0 1_2  
like image 24
iod Avatar answered Sep 19 '22 10:09

iod


Using aggregate and unique

aggregate(dat$ID,list(dat$Var1,dat$Var2,dat$Var3),unique)
like image 39
user2974951 Avatar answered Sep 18 '22 10:09

user2974951