Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R find overlap between strings in columns of two dataframes

I have the following dataframes df1 and df2. These are just examples, my true dataframes have around 1000 rows & 10 columns.

df1 = data.frame(V1 = c(1,2), V2 = c("a, b, d", "c, g"))

> df1
  V1         V2
1  1    a, b, d
2  2       c, g

df2 = data.frame(V3 = c(1,2,3), V4 = c("a, c", "b, e", "a, g"))

> df2
  V3      V4
1  1    a, c
2  2    b, e
3  3    a, g

As you see, columns V2 and V4 contain values that are separated by ", ".

I now want to loop through my df1 row by row and check if any of the values present in V2 matches with one of the values present in V4. So I don't want to compare the complete column content, but value per value.

This is how my output df3 should look like

df3 = data.frame(V1 = c(1, 1, 1, 2, 2), V2 = c("a, b, d", "a, b, d", "a, b, d", "c, g", "c, g"), V5 = c("a","a","b","c","g"), V3 = c(1, 3, 2, 1, 3), V4 = c("a, c","a, g", "b, e", "a, c", "a, g"), V6 = c("a", "a","b", "c", "g"))

> df3
  V1      V2 V5 V3   V4 V6
1  1 a, b, d  a  1 a, c  a
2  1 a, b, d  a  3 a, g  a
3  1 a, b, d  b  2 b, e  b
4  2    c, g  c  1 a, c  c
5  2    c, g  g  3 a, g  g

Let's look at the first row of df1. In V2 there are values "a", "b" and "d". We start with "a" and look in df2 if "a" is present in V4. This is true for lines 1 and 3 of df2. We create two lines in the output df3. Line 1 contains the info of the first match: original columns V1, V2, V3 and V4 + 2 new columns V5 and V6 which the contain the matched value "a" from df1 and df2 respectively.

I hope this is clear.

I would start with a for loop & I thought it might necessary to split my columns V2 by ',', but I'm not sure this is needed.

for (row in 1:nrow(df1)) {
 // split col V2 of df1 by ','
 // if V2 %in% in df2$V4 ...
}
like image 774
user1987607 Avatar asked Oct 23 '25 12:10

user1987607


1 Answers

Use separate_rows to separate a collapsed column on df1 and df2, and join them together with inner_join.

library(dplyr)
library(tidyr)

df.a <- df1 %>%
  mutate(V5 = V2) %>% 
  separate_rows(V5)

df.b <- df2 %>%
  mutate(V6 = V4) %>% 
  separate_rows(V6)

inner_join(df.a, df.b, c("V5" = "V6"))

#   V1      V2 V5 V3   V4
# 1  1 a, b, d  a  1 a, c
# 2  1 a, b, d  a  3 a, g
# 3  1 a, b, d  b  2 b, e
# 4  2    c, g  c  1 a, c
# 5  2    c, g  g  3 a, g
like image 67
Darren Tsai Avatar answered Oct 25 '25 00:10

Darren Tsai



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!