Simple Comparing of two texts in R

Tags:

I want to compare two texts to similarity, therefore i need a simple function to list clearly and chronologically the words and phrases occurring in both texts. these words/sentences should be highlighted or underlined for better visualization)

on the base of @joris Meys ideas, i added an array to divide text into sentences and subordinate sentences.

this is how it looks like:

  textparts <- function (text){
  textparts <- c("\\,", "\\.")
  i <- 1
  while(i<=length(textparts)){
        text <- unlist(strsplit(text, textparts[i]))
        i <- i+1
  }
  return (text)
}

textparts1 <- textparts("This is a complete sentence, whereas this is a dependent clause. This thing works.")
textparts2 <- textparts("This could be a sentence, whereas this is a dependent clause. Plagiarism is not cool. This thing works.")

  commonWords <- intersect(textparts1, textparts2)
  commonWords <- paste("\\<(",commonWords,")\\>",sep="")


  for(x in commonWords){
    textparts1 <- gsub(x, "\\1*", textparts1,ignore.case=TRUE)
    textparts2 <- gsub(x, "\\1*", textparts2,ignore.case=TRUE)
  }
  return(list(textparts1,textparts2))

However, sometimes it works, sometimes it doesn't.

I WOULD like to have results like these:

>   return(list(textparts1,textparts2))
[[1]]
[1] "This is a complete sentence"         " whereas this is a dependent clause*" " This thing works*"                  

[[2]]
[1] "This could be a sentence"            " whereas this is a dependent clause*" " Plagiarism is not cool"             " This thing works*"

whereas i get none results.

318

asked May 26 '11 09:05

digitalaxp

1 Answers

There are some problems with the answer of @Chase :

differences in capitalization are not taken into account
interpunction can mess up results
if there is more than one word similar, then you get a lot of warnings due to the gsub call.

Based on his idea, there is the following solution that makes use of tolower() and some nice functionalities of regular expressions :

compareSentences <- function(sentence1, sentence2) {
  # split everything on "not a word" and put all to lowercase
  x1 <- tolower(unlist(strsplit(sentence1, "\\W")))
  x2 <- tolower(unlist(strsplit(sentence2, "\\W")))

  commonWords <- intersect(x1, x2)
  #add word beginning and ending and put words between ()
  # to allow for match referencing in gsub
  commonWords <- paste("\\<(",commonWords,")\\>",sep="")


  for(x in commonWords){ 
    # replace the match by the match with star added
    sentence1 <- gsub(x, "\\1*", sentence1,ignore.case=TRUE)
    sentence2 <- gsub(x, "\\1*", sentence2,ignore.case=TRUE)
  }
  return(list(sentence1,sentence2))      
}

This gives following result :

text1 <- "This is a test. Weather is fine"
text2 <- "This text is a test. This weather is fine. This blabalba This "

compareSentences(text1,text2)
[[1]]
[1] "This* is* a* test*. Weather* is* fine*"

[[2]]
[1] "This* text is* a* test*. This* weather* is* fine*. This* blabalba This* "

answered Oct 02 '22 04:10

Joris Meys

Related questions
                            
                                How to only remove single parenthesis and keep the paired ones
                            
                                R: lagged "cumulative" difference between two values
                            
                                How to add multiple arrows to a path according to line direction using ggplot2?
                            
                                How to allow multiple inputs from user using R?
                            
                                Are there any R Packages for Graphs (shortest path, etc.)?
                            
                                mapply recycling arguments
                            
                                Installing Rcpp in R 2.10 on Ubuntu
                            
                                global variable in R function
                            
                                Read xts from CSV file in R
                            
                                How to catch an error/exception in R? [duplicate]
                            
                                Adjusting the relative space of facets (without regard to coordinate space)
                            
                                Problem with spline method = 'monoH.FC''
                            
                                How can I get Emacs ess to recognize a query string (within quotes) as code?
                            
                                R ggplot2 facetting - Error: No Layers in Plot
                            
                                Plotting summary statistics
                            
                                Adding options [keepaspectratio=true, scale = 0.75] to \includegraphics{} in Sweave
                            
                                Parallel processing and temporary files
                            
                                How can I color a specific bar in a bar plot (qplot, ggplot2)
                            
                                How to add values to a specific matrix row-column
                            
                                R - Select rows for random sample of column values?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Simple Comparing of two texts in R

Tags:

comparison

r

digitalaxp

People also ask

1 Answers

Joris Meys

Recent Activity

Donate For Us