I want to compare two texts to similarity, therefore i need a simple function to list clearly and chronologically the words and phrases occurring in both texts. these words/sentences should be highlighted or underlined for better visualization)
on the base of @joris Meys ideas, i added an array to divide text into sentences and subordinate sentences.
this is how it looks like:
textparts <- function (text){
textparts <- c("\\,", "\\.")
i <- 1
while(i<=length(textparts)){
text <- unlist(strsplit(text, textparts[i]))
i <- i+1
}
return (text)
}
textparts1 <- textparts("This is a complete sentence, whereas this is a dependent clause. This thing works.")
textparts2 <- textparts("This could be a sentence, whereas this is a dependent clause. Plagiarism is not cool. This thing works.")
commonWords <- intersect(textparts1, textparts2)
commonWords <- paste("\\<(",commonWords,")\\>",sep="")
for(x in commonWords){
textparts1 <- gsub(x, "\\1*", textparts1,ignore.case=TRUE)
textparts2 <- gsub(x, "\\1*", textparts2,ignore.case=TRUE)
}
return(list(textparts1,textparts2))
However, sometimes it works, sometimes it doesn't.
I WOULD like to have results like these:
> return(list(textparts1,textparts2))
[[1]]
[1] "This is a complete sentence" " whereas this is a dependent clause*" " This thing works*"
[[2]]
[1] "This could be a sentence" " whereas this is a dependent clause*" " Plagiarism is not cool" " This thing works*"
whereas i get none results.
When comparing texts, consider both what they have in common and what is different about them. If they have the same purpose: Do they use similar techniques? For example, two newspaper articles could use exaggeration to present completely different viewpoints of the same topic.
There are some problems with the answer of @Chase :
Based on his idea, there is the following solution that makes use of tolower()
and some nice functionalities of regular expressions :
compareSentences <- function(sentence1, sentence2) {
# split everything on "not a word" and put all to lowercase
x1 <- tolower(unlist(strsplit(sentence1, "\\W")))
x2 <- tolower(unlist(strsplit(sentence2, "\\W")))
commonWords <- intersect(x1, x2)
#add word beginning and ending and put words between ()
# to allow for match referencing in gsub
commonWords <- paste("\\<(",commonWords,")\\>",sep="")
for(x in commonWords){
# replace the match by the match with star added
sentence1 <- gsub(x, "\\1*", sentence1,ignore.case=TRUE)
sentence2 <- gsub(x, "\\1*", sentence2,ignore.case=TRUE)
}
return(list(sentence1,sentence2))
}
This gives following result :
text1 <- "This is a test. Weather is fine"
text2 <- "This text is a test. This weather is fine. This blabalba This "
compareSentences(text1,text2)
[[1]]
[1] "This* is* a* test*. Weather* is* fine*"
[[2]]
[1] "This* text is* a* test*. This* weather* is* fine*. This* blabalba This* "
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With