I have two strings:
a <- "Roy lives in Japan and travels to Africa"
b <- "Roy travels Africa with this wife"
I am looking to get a count of common words between these strings.
The answer should be 3.
"Roy"
"travels"
being the common words
This is what I tried:
stra <- as.data.frame(t(read.table(textConnection(a), sep = " ")))
strb <- as.data.frame(t(read.table(textConnection(b), sep = " ")))
Taking unique to avoid repeat counting
stra_unique <-as.data.frame(unique(stra$V1))
strb_unique <- as.data.frame(unique(strb$V1))
colnames(stra_unique) <- c("V1")
colnames(strb_unique) <- c("V1")
common_words <-length(merge(stra_unique,strb_unique, by = "V1")$V1)
I need to this for a data set with over 2000 and 1200 strings. Total times I have to evaluate the string is 2000 X 1200. Any quick way, without using loops.
Approach: Count the frequencies of all the characters from both strings. Now, for every character if the frequency of this character in string s1 is freq1 and in string s2 is freq2 then total valid pairs with this character will be min(freq1, freq2). The sum of this value for all the characters is the required answer.
Approach: First, we split the string by spaces in a. Then, take a variable count = 0 and in every true condition we increment the count by 1. Now run a loop at 0 to length of string and check if our string is equal to the word.
You can use strsplit
and intersect
from the base
library:
> a <- "Roy lives in Japan and travels to Africa"
> b <- "Roy travels Africa with this wife"
> a_split <- unlist(strsplit(a, sep=" "))
> b_split <- unlist(strsplit(b, sep=" "))
> length(intersect(a_split, b_split))
[1] 3
Perhaps, using intersect
and str_extract
For multiple strings
, you can either put them as a list
or as vector
vec1 <- c(a,b)
Reduce(`intersect`,str_extract_all(vec1, "\\w+"))
#[1] "Roy" "travels" "Africa"
For faster
options, consider stringi
library(stringi)
Reduce(`intersect`,stri_extract_all_regex(vec1,"\\w+"))
#[1] "Roy" "travels" "Africa"
For counting:
length(Reduce(`intersect`,stri_extract_all_regex(vec1,"\\w+")))
#[1] 3
Or using base R
Reduce(`intersect`,regmatches(vec1,gregexpr("\\w+", vec1)))
#[1] "Roy" "travels" "Africa"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With