Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - difference between 2 sets in data frame

I have 2 factor columns, I want to create a third column which tells me what the second one has that the first does not. It's very similar to this post but I'm having trouble going from a df to using setdiff() function.
For example:

library(dplyr)
y1 <- c("a.b.","a.","b.c.d.")
y2 <- c("a.b.c.","a.b.","b.c.d.")
df <- data.frame(y1,y2)

Column y1 has a.b. and column y2 has a.b.c.. I want a thirds column to return c. or just c.

> df
      y1     y2  col3
1   a.b.  a.b.c.  c.
2     a.    a.b.  b.
3 b.c.d.  b.c.d.  

I think that is should be a combination of strsplit and setdiff, but I can't get it to work.

I've tried to convert the factor into character, then I've tried applying strsplit() to the results, but the output seems a but weird to me. It seems to have created a list within a list, which makes it difficult to pass to setdiff()

#convert factor to character
df <- df %>% mutate_if(is.factor, as.character)
lapply(df$y1,function(x)(strsplit(x,split = "[.]")))

> lapply(df$y1,function(x)(strsplit(x,split = "[.]")))
[[1]]
[[1]][[1]]
[1] "a" "b"


[[2]]
[[2]][[1]]
[1] "a"


[[3]]
[[3]][[1]]
[1] "b" "c" "d"
like image 740
jmich738 Avatar asked Apr 18 '18 01:04

jmich738


People also ask

How do I subtract two Dataframes in R?

To do this, we simply need to use minus sign. For example, if we have data-frames df1 and df2 then the subtraction can be found as df1-df2.

What is the difference between Cbind and Rbind in R?

cbind() and rbind() both create matrices by combining several vectors of the same length. cbind() combines vectors as columns, while rbind() combines them as rows.

What does set diff do in R?

The setdiff() function in R can be used to find differences between two sets. This function uses the following syntax: setdiff(x, y) where: x, y: Vectors or data frames containing a sequence of items.


2 Answers

Update

There was an issue when the difference had more than 1 character, it created an additional row. To overcome that we paste all the elements together for each difference. This also saves us from the unlist step.

df$col3 <- mapply(function(x, y) paste0(setdiff(y, x), collapse = ""),
   strsplit(as.character(df$y1), "\\."), strsplit(as.character(df$y2), "\\."))

Original Answer

We can use mapply and split both the columns on "." using strsplit and then take the difference between them using setdiff.

df$col3 <- mapply(function(x, y) setdiff(y, x),
       strsplit(as.character(df$y1), "\\."), strsplit(as.character(df$y2), "\\."))

df
#     y1     y2 col3
#1   a.b. a.b.c.    c
#2     a.   a.b.    b
#3 b.c.d. b.c.d.     

If we don't want col3 as list we can unlist it however, one issue in that is if we unlist it removes the character(0) value from it. To retain that value we need to perform an additional check on it. Taken from here.

unlist(lapply(df$col3,function(x) if(identical(x,character(0))) ' ' else x))

#[1] "c" "b" " "
like image 160
Ronak Shah Avatar answered Nov 15 '22 09:11

Ronak Shah


You can also use purrr:map2:

df %>%
    mutate_if(is.factor, as.character) %>%
    mutate(col3 = map2(strsplit(y2, "\\."), strsplit(y1, "\\."), setdiff))
#      y1     y2 col3
#1   a.b. a.b.c.    c
#2     a.   a.b.    b
#3 b.c.d. b.c.d.    

Explanation: Convert factors to character vectors, use setdiff on the "."-split columns y2 and y1. Note that col3 is a list.


Update

It appears that unnest drops the zero-length character entries from the list. So to convert col3 from a list to a character vector you can do:

df %>%
    mutate_if(is.factor, as.character) %>%
    mutate(col3 = map2(strsplit(y2, "\\."), strsplit(y1, "\\."), setdiff)) %>%
    rowwise() %>%
    mutate(col3 = paste(col3, collapse = "."))
## A tibble: 3 x 3
#  y1     y2     col3
#  <chr>  <chr>  <chr>
#1 a.b.   a.b.c. c
#2 a.     a.b.   b
#3 b.c.d. b.c.d. ""

The idea here is to string-concatenate col3 entries (if there are multiple); using rowwise() ensures row-wise paste.

For the updated sample data from your comment:

y1 <- c("a.b.","a.","b.c.d.")
y2 <- c("a.b.c.e.","a.b.","b.c.d.")
df <- data.frame(y1,y2)
df %>%
    mutate_if(is.factor, as.character) %>%
    mutate(col3 = map2(strsplit(y2, "\\."), strsplit(y1, "\\."), setdiff)) %>%
    rowwise() %>%
    mutate(col3 = paste(col3, collapse = "."))
## A tibble: 3 x 3
#  y1     y2       col3
#  <chr>  <chr>    <chr>
#1 a.b.   a.b.c.e. c.e
#2 a.     a.b.     b
#3 b.c.d. b.c.d.   ""
like image 32
Maurits Evers Avatar answered Nov 15 '22 10:11

Maurits Evers