I have a dataframe as follows:
Input
one<-c("Rainy and sunny;thundering;lightning","dismal and dreary;thundering")
two<-c("Overcast;lightning","Overcast;dismal and dreary")
df2<-data.frame(one,two)
I want to compare the strings in the lists by row and extract what is the same, and what is different in new columns
The output I am expecting is:
same<-c("lightning","dismal and dreary")
different_Incol1ButNot2<-c("Rainy and sunny;thundering","thundering")
different_Incol2ButNot1<-c("Overcast","Overcast")
df2<-data.frame(one,two,same,different_Incol1ButNot2,different_Incol2ButNot1,stringsAsFactors=F)
which should output:
one two same different_Incol1ButNot2 different_Incol2ButNot1
Rainy and sunny;thundering;lightning Overcast;lightning lightning Rainy and sunny;thundering Overcast
dismal and dreary;thundering Overcast;dismal and dreary dismal and dreary thundering Overcast
so my first thought was to split and list each string:
df3$one<-as.list(strsplit(df3$one, ";"))
df3$two<-as.list(strsplit(df3$two, ";"))
However now I do not know how to compare row-wise the lists I have created within the dataframe so I guess the question is how do I make these row wise comparisons between lists of strings within a dataframe or is there an easier way to do this?
Here is an idea via dplyr
,
library(dplyr)
df %>%
mutate_all(funs(strsplit(as.character(.), ';'))) %>%
rowwise() %>%
mutate(same = toString(intersect(one, two)),
differs_1 = toString(setdiff(one, two)),
differs_2 = setdiff(two, one))
which gives,
Source: local data frame [2 x 5] Groups: <by row> # A tibble: 2 x 5 one two same differs_1 differs_2 <list> <list> <chr> <chr> <chr> 1 <chr [3]> <chr [2]> lightning Rainy and sunny, thundering Overcast 2 <chr [2]> <chr [2]> dismal and dreary thundering Overcast
First, you should use character
columns, not factor (by default stringsAsFactors=TRUE
), ie:
one <- c("Rainy and sunny;thundering;lightning","dismal and dreary;thundering")
two <- c("Overcast;lightning","Overcast;dismal and dreary")
df2 <- data.frame(one,two, stringsAsFactors = FALSE)
You can use set operations, namely intersect
and setdiff
here. You can try them outside but a function is handy.
compare_strings <- function(x){
l <- sapply(x, strsplit, ";")
list(one=x$one,
two=x$two,
same=intersect(l[[1]], l[[2]]),
different_Incol1ButNot2=paste(setdiff(l[[1]], l[[2]]), collapse=";"),
different_Incol2ButNot1=paste(setdiff(l[[2]], l[[1]]), collapse=";")
)
}
Applied on a single line of your df2
, it returns a named list with all the components you want.
> compare_strings(df2[1, ])
$one
[1] "Rainy and sunny;thundering;lightning"
$two
[1] "Overcast;lightning"
$same
[1] "lightning"
$different_Incol1ButNot2
[1] "Rainy and sunny;thundering"
$different_Incol2ButNot1
[1] "Overcast"
If we apply this to every row of your data.frame
, and rbind
the resulting list of lists then we have the final data.frame
you want:
do.call("rbind", lapply(seq_len(nrow(df2)), function(i) compare_strings(df2[i, ])))
one two
[1,] "Rainy and sunny;thundering;lightning" "Overcast;lightning"
[2,] "dismal and dreary;thundering" "Overcast;dismal and dreary"
same different_Incol1ButNot2 different_Incol2ButNot1
[1,] "lightning" "Rainy and sunny;thundering" "Overcast"
[2,] "dismal and dreary" "thundering" "Overcast"
Does this solve your problem?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With