Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to compare lists in a dataframe

Tags:

r

I have a dataframe as follows:

Input

one<-c("Rainy and sunny;thundering;lightning","dismal and dreary;thundering")
two<-c("Overcast;lightning","Overcast;dismal and dreary")
df2<-data.frame(one,two)

I want to compare the strings in the lists by row and extract what is the same, and what is different in new columns

The output I am expecting is:

same<-c("lightning","dismal and dreary")
different_Incol1ButNot2<-c("Rainy and sunny;thundering","thundering")
different_Incol2ButNot1<-c("Overcast","Overcast")

df2<-data.frame(one,two,same,different_Incol1ButNot2,different_Incol2ButNot1,stringsAsFactors=F) 

which should output:

    one                                  two                        same               different_Incol1ButNot2  different_Incol2ButNot1
 Rainy and sunny;thundering;lightning   Overcast;lightning          lightning          Rainy and sunny;thundering      Overcast
 dismal and dreary;thundering           Overcast;dismal and dreary  dismal and dreary  thundering                      Overcast

so my first thought was to split and list each string:

df3$one<-as.list(strsplit(df3$one, ";"))
df3$two<-as.list(strsplit(df3$two, ";"))

However now I do not know how to compare row-wise the lists I have created within the dataframe so I guess the question is how do I make these row wise comparisons between lists of strings within a dataframe or is there an easier way to do this?

like image 518
Sebastian Zeki Avatar asked Oct 02 '17 11:10

Sebastian Zeki


2 Answers

Here is an idea via dplyr,

library(dplyr)

df %>% 
 mutate_all(funs(strsplit(as.character(.), ';'))) %>% 
 rowwise() %>% 
 mutate(same = toString(intersect(one, two)), 
        differs_1 = toString(setdiff(one, two)), 
        differs_2 = setdiff(two, one))

which gives,

Source: local data frame [2 x 5]
Groups: <by row>

# A tibble: 2 x 5
        one       two              same                   differs_1 differs_2
     <list>    <list>             <chr>                       <chr>     <chr>
1 <chr [3]> <chr [2]>         lightning Rainy and sunny, thundering  Overcast
2 <chr [2]> <chr [2]> dismal and dreary                  thundering  Overcast
like image 61
Sotos Avatar answered Sep 25 '22 01:09

Sotos


First, you should use charactercolumns, not factor (by default stringsAsFactors=TRUE), ie:

one <- c("Rainy and sunny;thundering;lightning","dismal and dreary;thundering")
two <- c("Overcast;lightning","Overcast;dismal and dreary")
df2 <- data.frame(one,two, stringsAsFactors = FALSE)

You can use set operations, namely intersect and setdiff here. You can try them outside but a function is handy.

compare_strings <- function(x){
  l <- sapply(x, strsplit, ";")
  list(one=x$one,
       two=x$two,
       same=intersect(l[[1]], l[[2]]),
       different_Incol1ButNot2=paste(setdiff(l[[1]], l[[2]]), collapse=";"),
       different_Incol2ButNot1=paste(setdiff(l[[2]], l[[1]]), collapse=";")                                 
  )
}

Applied on a single line of your df2, it returns a named list with all the components you want.

> compare_strings(df2[1, ])
$one
[1] "Rainy and sunny;thundering;lightning"

$two
[1] "Overcast;lightning"

$same
[1] "lightning"

$different_Incol1ButNot2
[1] "Rainy and sunny;thundering"

$different_Incol2ButNot1
[1] "Overcast"

If we apply this to every row of your data.frame, and rbind the resulting list of lists then we have the final data.frame you want:

do.call("rbind", lapply(seq_len(nrow(df2)), function(i) compare_strings(df2[i, ])))
one                                    two                         
[1,] "Rainy and sunny;thundering;lightning" "Overcast;lightning"        
[2,] "dismal and dreary;thundering"         "Overcast;dismal and dreary"
same                different_Incol1ButNot2      different_Incol2ButNot1
[1,] "lightning"         "Rainy and sunny;thundering" "Overcast"             
[2,] "dismal and dreary" "thundering"                 "Overcast"    

Does this solve your problem?

like image 30
Vincent Bonhomme Avatar answered Sep 25 '22 01:09

Vincent Bonhomme