I have a dataframe as follows:
Input
one<-c("Rainy and sunny;thundering;lightning","dismal and dreary;thundering")
two<-c("Overcast;lightning","Overcast;dismal and dreary")
df2<-data.frame(one,two)
I want to compare the strings in the lists by row and extract what is the same, and what is different in new columns
The output I am expecting is:
same<-c("lightning","dismal and dreary")
different_Incol1ButNot2<-c("Rainy and sunny;thundering","thundering")
different_Incol2ButNot1<-c("Overcast","Overcast")
df2<-data.frame(one,two,same,different_Incol1ButNot2,different_Incol2ButNot1,stringsAsFactors=F) 
which should output:
    one                                  two                        same               different_Incol1ButNot2  different_Incol2ButNot1
 Rainy and sunny;thundering;lightning   Overcast;lightning          lightning          Rainy and sunny;thundering      Overcast
 dismal and dreary;thundering           Overcast;dismal and dreary  dismal and dreary  thundering                      Overcast
so my first thought was to split and list each string:
df3$one<-as.list(strsplit(df3$one, ";"))
df3$two<-as.list(strsplit(df3$two, ";"))
However now I do not know how to compare row-wise the lists I have created within the dataframe so I guess the question is how do I make these row wise comparisons between lists of strings within a dataframe or is there an easier way to do this?
Here is an idea via dplyr,
library(dplyr)
df %>% 
 mutate_all(funs(strsplit(as.character(.), ';'))) %>% 
 rowwise() %>% 
 mutate(same = toString(intersect(one, two)), 
        differs_1 = toString(setdiff(one, two)), 
        differs_2 = setdiff(two, one))
which gives,
Source: local data frame [2 x 5] Groups: <by row> # A tibble: 2 x 5 one two same differs_1 differs_2 <list> <list> <chr> <chr> <chr> 1 <chr [3]> <chr [2]> lightning Rainy and sunny, thundering Overcast 2 <chr [2]> <chr [2]> dismal and dreary thundering Overcast
First, you should use charactercolumns, not factor (by default stringsAsFactors=TRUE), ie:
one <- c("Rainy and sunny;thundering;lightning","dismal and dreary;thundering")
two <- c("Overcast;lightning","Overcast;dismal and dreary")
df2 <- data.frame(one,two, stringsAsFactors = FALSE)
You can use set operations, namely intersect and setdiff here. You can try them outside but a function is handy.
compare_strings <- function(x){
  l <- sapply(x, strsplit, ";")
  list(one=x$one,
       two=x$two,
       same=intersect(l[[1]], l[[2]]),
       different_Incol1ButNot2=paste(setdiff(l[[1]], l[[2]]), collapse=";"),
       different_Incol2ButNot1=paste(setdiff(l[[2]], l[[1]]), collapse=";")                                 
  )
}
Applied on a single line of your df2, it returns a named list with all the components you want. 
> compare_strings(df2[1, ])
$one
[1] "Rainy and sunny;thundering;lightning"
$two
[1] "Overcast;lightning"
$same
[1] "lightning"
$different_Incol1ButNot2
[1] "Rainy and sunny;thundering"
$different_Incol2ButNot1
[1] "Overcast"
If we apply this to every row of your data.frame, and rbind the resulting list of lists then we have the final data.frame you want:
do.call("rbind", lapply(seq_len(nrow(df2)), function(i) compare_strings(df2[i, ])))
one                                    two                         
[1,] "Rainy and sunny;thundering;lightning" "Overcast;lightning"        
[2,] "dismal and dreary;thundering"         "Overcast;dismal and dreary"
same                different_Incol1ButNot2      different_Incol2ButNot1
[1,] "lightning"         "Rainy and sunny;thundering" "Overcast"             
[2,] "dismal and dreary" "thundering"                 "Overcast"    
Does this solve your problem?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With