Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to assign identical unique IDs to matching observations between two dataframes in r?


I have a practical question when I have two (or more) data frames and want to assign unique IDs for each matching observation within each and across both datasets e.g.:

#1. Create dataframe df1:

a1 <- c(1, 1, 1, 1, 2, 2, 2, 2, 1, 1)
b1 <- c(1, 5, 3, 2, 3, 4, 5, 1, 5, 2)
c1 <- c("white", "red", "black", "white", "red", 
        "white", "black", "silver", "red", "green")
df1 <- data.frame(a1, b1, c1)
df1

   a1 b1     c1
1   1  1  white
2   1  5    red
3   1  3  black
4   1  2  white
5   2  3    red
6   2  4  white
7   2  5  black
8   2  1 silver
9   1  5    red
10  1  2  green

#2. Create dataframe df2:

a2 <- c(2, 2, 1, 1, 2, 2, 2, 2, 2, 2)
b2 <- c(3, 1, 3, 2, 1, 3, 4, 5, 3, 5)
c2 <- c("black", "blue", "black", "white", "silver", 
        "green", "green", "red", "blue", "white")
df2 <- data.frame(a2, b2, c2)
df2

   a2 b2     c2
1   2  3  black
2   2  1   blue
3   1  3  black
4   1  2  white
5   2  1 silver
6   2  3  green
7   2  4  green
8   2  5    red
9   2  3   blue
10  2  5  white

#3. Assign unique IDs to each observation in df1:

library(data.table)
df1.2 <- data.table(df1, key="a1,b1,c1") 
df1.2[, id:=.GRP, by=key(df1.2)]
df1.2 <- as.data.frame(df1.2)
df1.2

   a1 b1     c1 id
1   1  1  white  1
2   1  2  green  2
3   1  2  white  3
4   1  3  black  4
5   1  5    red  5
6   1  5    red  5
7   2  1 silver  6
8   2  3    red  7
9   2  4  white  8
10  2  5  black  9

#4. The problematic part!! Assign identical unique IDs to matching observations of df2 as compared to df1.2 
#and assign other unique IDs to all other non-matching obs of df2. 
#Name the resulting dataframe as df2.2 
#My expected result will ideally look as follows:

df2.2

   a2 b2     c2 id
1   2  3  black 10 
2   2  1   blue 11
3   1  3  black  4
4   1  2  white  3
5   2  1 silver  6
6   2  3  green 12
7   2  4  green 13
8   2  5    red 14
9   2  3   blue 15
10  2  5  white 16

Any help on how to get to df2.2 will be very much appreciated. Thanks.

like image 339
simtim Avatar asked May 13 '15 21:05

simtim


1 Answers

An easy way to approach this is to make a hash:

library(dplyr)
library(digest)

df1 %>%
  rowwise() %>%
  do( data.frame(., id=digest( paste(.$a1,.$b1,.$c1), algo="md5"),
                   stringsAsFactors=FALSE)) %>% ungroup()

df2 %>%
  rowwise() %>%
  do( data.frame(., id=digest( paste(.$a2,.$b2,.$c2), algo="md5"),
               stringsAsFactors=FALSE)) %>% ungroup()

which would produce the following for df1:

   a1 b1     c1                               id
1   1  1  white b86fbb78b27f7db2ee50af2d68cce452
2   1  5    red 68d47f544832989834517630e4a2764c
3   1  3  black 724e37192140cb2009cf3d982f2be1e4
4   1  2  white f731b8b38255b8c312543283f8e1c634
5   2  3    red 2d50b86902056a51faad04d2c566faf2
6   2  4  white 9396667cd51d1e1b61b0b22a7767d3d9
7   2  5  black 9ba1f3e04c61c006d3c5382fcad098e6
8   2  1 silver 38dcd29d200c8b33cd38ac78ef9dd751
9   1  5    red 68d47f544832989834517630e4a2764c
10  1  2  green 7d9b1aadfd79de142b234b83d7867b9b

and the following for df2:

   a2 b2     c2                               id
1   2  3  black d285febc8ab08e99b11609b98f077e66
2   2  1   blue bfa0405276406ac4bc596daf957dfa11
3   1  3  black 724e37192140cb2009cf3d982f2be1e4
4   1  2  white f731b8b38255b8c312543283f8e1c634
5   2  1 silver 38dcd29d200c8b33cd38ac78ef9dd751
6   2  3  green 67eefe9ee2d82486ded30a268289296b
7   2  4  green d773f58cf144eab15ef459e326494a2f
8   2  5    red 0724318a9f59d3960edfe4e90f9c4eff
9   2  3   blue 6883420cc137ba45b773f642176e9ce6
10  2  5  white 5dea9e63b5fbfb31fb81260cb5a5f41c
like image 196
akhmed Avatar answered Sep 29 '22 09:09

akhmed