I have large dataframe called df with some ID's.
I have another dataframe (id_list) with a set of matching ID's and its associated features for each ID. The ID are not sequentally ordered in both dataframes.
Effectively i would like to look up from the larger dataframe df to the id_list and add two columns namely Display and Type to the current dataframe df.
There are numerous confusing examples. What could be the most effective way of doing this. I tried using match() , %in% and failed miserably.
Here is a reproducible example.
df <- data.frame(Feats = matrix(rnorm(20), nrow = 20, ncol = 5), ID = sample.int(10, 10))
id_list <- data.frame(ID = sample.int(10,10),
Display = sample(c('clear', 'blur'), 20, replace = TRUE),
Type = sample(c('red', 'green', 'blue', 'indigo', 'yellow'), 20, replace = TRUE))
Feats.1 Feats.2 Feats.3 Feats.4 Feats.5 ID
1 3.14944573 -0.52285062 3.14944573 -0.52285062 3.14944573 2
2 -0.41096007 0.38256691 -0.41096007 0.38256691 -0.41096007 1
3 0.03629351 -0.02514005 0.03629351 -0.02514005 0.03629351 7
4 0.91257290 1.35590761 0.91257290 1.35590761 0.91257290 5
5 -0.26927311 -2.10213773 -0.26927311 -2.10213773 -0.26927311 3
6 3.14944573 -0.52285062 3.14944573 -0.52285062 3.14944573 4
7 -0.41096007 0.38256691 -0.41096007 0.38256691 -0.41096007 10
8 0.03629351 -0.02514005 0.03629351 -0.02514005 0.03629351 6
9 0.91257290 1.35590761 0.91257290 1.35590761 0.91257290 8
10 -0.26927311 -2.10213773 -0.26927311 -2.10213773 -0.26927311 9
ID Display Type
1 6 clear indigo
2 1 blur blue
3 7 clear red
4 4 clear red
5 3 blur red
6 10 clear yellow
7 2 clear blue
8 8 blur green
9 5 clear blue
10 9 clear green
The resulting end df should be of size [20 x 8].
You can use merge from base R or left_join from dplyr to do this pretty easily. (There's also data.table::merge, which maybe someone else can give an answer with.) You probably want to take steps to ensure that you don't lose any data if there's an entry in your data frame that doesn't have a corresponding ID in the lookup. If that's not the case, you can change all.x to false or null in merge, or switch from left_join to inner_join. To illustrate, I added a dummy row to the data with an ID that doesn't exist in the lookup table.
df <- data.frame(Feats = matrix(rnorm(10), nrow = 5, ncol = 5), ID = sample.int(10, 10))
dummy <- df[1, ]
dummy$ID <- 12
df <- rbind(dummy, df)
id_list <- data.frame(ID = sample.int(10,10),
Display = sample(c('clear', 'blur'), 10, replace = TRUE),
Type = sample(c('red', 'green', 'blue', 'indigo', 'yellow'), 10, replace = TRUE))
With merge, you set either by as the column name from both data frames to join by, or by.x and by.y if they have different names. all.x = T will keep all observations in the first data frame even if they don't match an observation in the second data frame.
merged1 <- merge(df, id_list, by = "ID", sort = F, all.x = T)
merged1
#> ID Feats.1 Feats.2 Feats.3 Feats.4 Feats.5 Display
#> 1 10 -1.44053344 1.0086988 -1.44053344 1.0086988 -1.44053344 clear
#> 2 5 0.99220217 -0.3125813 0.99220217 -0.3125813 0.99220217 clear
#> 3 2 1.03881289 1.1277627 1.03881289 1.1277627 1.03881289 clear
#> 4 7 -0.01678186 -0.1519029 -0.01678186 -0.1519029 -0.01678186 clear
#> 5 4 0.07130125 1.1715833 0.07130125 1.1715833 0.07130125 clear
#> 6 6 -1.44053344 1.0086988 -1.44053344 1.0086988 -1.44053344 clear
#> 7 8 0.99220217 -0.3125813 0.99220217 -0.3125813 0.99220217 blur
#> 8 3 1.03881289 1.1277627 1.03881289 1.1277627 1.03881289 clear
#> 9 1 -0.01678186 -0.1519029 -0.01678186 -0.1519029 -0.01678186 clear
#> 10 9 0.07130125 1.1715833 0.07130125 1.1715833 0.07130125 clear
#> 11 12 -1.44053344 1.0086988 -1.44053344 1.0086988 -1.44053344 <NA>
#> Type
#> 1 indigo
#> 2 yellow
#> 3 blue
#> 4 indigo
#> 5 yellow
#> 6 indigo
#> 7 green
#> 8 red
#> 9 red
#> 10 blue
#> 11 <NA>
dplyr::left_join keeps all observations from the first data frame and merges in any matching ones from the second.
joined <- dplyr::left_join(df, id_list, by = "ID")
head(joined)
#> Feats.1 Feats.2 Feats.3 Feats.4 Feats.5 ID Display
#> 1 -1.44053344 1.0086988 -1.44053344 1.0086988 -1.44053344 12 <NA>
#> 2 -1.44053344 1.0086988 -1.44053344 1.0086988 -1.44053344 10 clear
#> 3 0.99220217 -0.3125813 0.99220217 -0.3125813 0.99220217 5 clear
#> 4 1.03881289 1.1277627 1.03881289 1.1277627 1.03881289 2 clear
#> 5 -0.01678186 -0.1519029 -0.01678186 -0.1519029 -0.01678186 7 clear
#> 6 0.07130125 1.1715833 0.07130125 1.1715833 0.07130125 4 clear
#> Type
#> 1 <NA>
#> 2 indigo
#> 3 yellow
#> 4 blue
#> 5 indigo
#> 6 yellow
Created on 2018-07-13 by the reprex package (v0.2.0).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With