Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Identify differences in text paragraphs with R

I would like to use R to compare written text and extract sections which differ between the elements.

Consider a and b two text paragraphs. One is a modified version of the other:

a <- "This part is the same. This part is old."
b <- "This string is updated. This part is the same."

I want to compare the two strings and receive the part of the string which is unique to either of the two as output, preferably separate for both input strings.

Expected output:

stringdiff <- list(a = " This part is old.", b = "This string is updated. ")

> stringdiff
$a
[1] " This part is old."

$b
[1] "This string is updated. "

I've tried a solution from Extract characters that differ between two strings, but this only compares unique characters. The answer in Simple Comparing of two texts in R comes closer, but still only compares unique words.

Is there any way to get the expected output without too much of a hassle?

like image 271
LAP Avatar asked Nov 29 '17 09:11

LAP


1 Answers

We concatenate both the strings, split at the space after the . to create a list of sentences ('lst'), get the unique elements from unlisting the 'lst' ('un1'), using setdiff we get the elements that are not in 'un1'

lst <- strsplit(c(a= a, b = b), "(?<=[.])\\s", perl = TRUE)
un1 <- unique(unlist(lst))
lapply(lst, setdiff, x= un1)
like image 189
akrun Avatar answered Nov 08 '22 11:11

akrun