I have two strings: <pre class="prettyprint"><code>a <- "Roy lives in Japan and travels to Africa" b <- "Roy travels Africa with this wife" </code></pre> I am looking to get a count of common words between these strings. The answer should be 3. <ul> <li>"Roy"</li> <li>"travels"</li> <li>"Africa"</li> </ul> being the common words This is what I tried: <pre class="prettyprint"><code>stra <- as.data.frame(t(read.table(textConnection(a), sep = " "))) strb <- as.data.frame(t(read.table(textConnection(b), sep = " "))) </code></pre> Taking unique to avoid repeat counting <pre class="prettyprint"><code>stra_unique <-as.data.frame(unique(stra$V1)) strb_unique <- as.data.frame(unique(strb$V1)) colnames(stra_unique) <- c("V1") colnames(strb_unique) <- c("V1") common_words <-length(merge(stra_unique,strb_unique, by = "V1")$V1) </code></pre> I need to this for a data set with over 2000 and 1200 strings. Total times I have to evaluate the string is 2000 X 1200. Any quick way, without using loops.

You can use <code>strsplit</code> and <code>intersect</code> from the <code>base</code> library: <pre class="prettyprint"><code>> a <- "Roy lives in Japan and travels to Africa" > b <- "Roy travels Africa with this wife" > a_split <- unlist(strsplit(a, sep=" ")) > b_split <- unlist(strsplit(b, sep=" ")) > length(intersect(a_split, b_split)) [1] 3 </code></pre>

Perhaps, using <code>intersect</code> and <code>str_extract</code> For <code>multiple strings</code>, you can either put them as a <code>list</code> or as <code>vector</code> <pre class="prettyprint"><code> vec1 <- c(a,b) Reduce(`intersect`,str_extract_all(vec1, "\\w+")) #[1] "Roy" "travels" "Africa" </code></pre> For <code>faster</code> options, consider <code>stringi</code> <pre class="prettyprint"><code> library(stringi) Reduce(`intersect`,stri_extract_all_regex(vec1,"\\w+")) #[1] "Roy" "travels" "Africa" </code></pre> For counting: <pre class="prettyprint"><code> length(Reduce(`intersect`,stri_extract_all_regex(vec1,"\\w+"))) #[1] 3 </code></pre> Or using <code>base R</code> <pre class="prettyprint"><code> Reduce(`intersect`,regmatches(vec1,gregexpr("\\w+", vec1))) #[1] "Roy" "travels" "Africa" </code></pre>

Count common words in two strings

Tags:

string

r

data-analysis

text-mining

I have two strings:

a <- "Roy lives in Japan and travels to Africa"
b <- "Roy travels Africa with this wife"

I am looking to get a count of common words between these strings.

The answer should be 3.

"Roy"
"travels"
"Africa"

being the common words

This is what I tried:

stra <- as.data.frame(t(read.table(textConnection(a), sep = " ")))
strb <- as.data.frame(t(read.table(textConnection(b), sep = " ")))

Taking unique to avoid repeat counting

stra_unique <-as.data.frame(unique(stra$V1))
strb_unique <- as.data.frame(unique(strb$V1))
colnames(stra_unique) <- c("V1")
colnames(strb_unique) <- c("V1")

common_words <-length(merge(stra_unique,strb_unique, by = "V1")$V1)

I need to this for a data set with over 2000 and 1200 strings. Total times I have to evaluate the string is 2000 X 1200. Any quick way, without using loops.

802

asked Sep 19 '14 09:09

Jaimik Jain

2 Answers

You can use strsplit and intersect from the base library:

> a <- "Roy lives in Japan and travels to Africa"
> b <- "Roy travels Africa with this wife"
> a_split <- unlist(strsplit(a, sep=" "))
> b_split <- unlist(strsplit(b, sep=" "))
> length(intersect(a_split, b_split))
[1] 3

135

answered Sep 25 '22 14:09

Alex Reynolds

Perhaps, using intersect and str_extract For multiple strings, you can either put them as a list or as vector

 vec1 <- c(a,b)
 Reduce(`intersect`,str_extract_all(vec1, "\\w+"))
 #[1] "Roy"     "travels" "Africa"

For faster options, consider stringi

 library(stringi)
 Reduce(`intersect`,stri_extract_all_regex(vec1,"\\w+"))
 #[1] "Roy"     "travels" "Africa"

For counting:

 length(Reduce(`intersect`,stri_extract_all_regex(vec1,"\\w+")))
 #[1] 3

Or using base R

  Reduce(`intersect`,regmatches(vec1,gregexpr("\\w+", vec1)))
  #[1] "Roy"     "travels" "Africa"

answered Sep 23 '22 14:09

akrun

Related questions
                            
                                How to calculate arcsin(sgn(x)√|x|)?
                            
                                How to extract values between adjacent variables in a correlation matrix in R?
                            
                                Insert NA values into dataframe blank cells when importing read.csv/read.xlsx
                            
                                What is the purpose of environments in R and when I need to use more than one?
                            
                                How to show matrix values on Levelplot
                            
                                Vectorise find closest date function
                            
                                Seconds since midnight to time of day
                            
                                Can Rcpp package DLLs be unloaded without restarting R?
                            
                                How to write rasters after stacking them?
                            
                                Why does R find a data.frame variable that isn't in the data.frame?
                            
                                Convert to the day and time of the year in R
                            
                                Elegant way to vectorize seq?
                            
                                How do I extract the Correlation of fixed effects part of the lmer output
                            
                                Split a vector into three vectors of unequal length in R
                            
                                How to trim and replace a string
                            
                                In R, how to replace values in multiple columns with a vector of values equal to the same width?
                            
                                given value of matrix, getting it's coordinate
                            
                                How to arrange column in heatmap.2() based on a predefined order
                            
                                Different results with formula and non-formula for caret training
                            
                                Split a column by group [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With