Identify differences in text paragraphs with R

Question

I would like to use R to compare written text and extract sections which differ between the elements.

Consider a and b two text paragraphs. One is a modified version of the other:

a <- "This part is the same. This part is old."
b <- "This string is updated. This part is the same."

I want to compare the two strings and receive the part of the string which is unique to either of the two as output, preferably separate for both input strings.

Expected output:

stringdiff <- list(a = " This part is old.", b = "This string is updated. ")

> stringdiff
$a
[1] " This part is old."

$b
[1] "This string is updated. "

I've tried a solution from Extract characters that differ between two strings, but this only compares unique characters. The answer in Simple Comparing of two texts in R comes closer, but still only compares unique words.

Is there any way to get the expected output without too much of a hassle?

akrun · Accepted Answer

We concatenate both the strings, split at the space after the . to create a list of sentences ('lst'), get the unique elements from unlisting the 'lst' ('un1'), using setdiff we get the elements that are not in 'un1'

lst <- strsplit(c(a= a, b = b), "(?<=[.])\s", perl = TRUE)
un1 <- unique(unlist(lst))
lapply(lst, setdiff, x= un1)

Identify differences in text paragraphs with R

Tags:

string

comparison

r

LAP

1 Answers

akrun

Recent Activity

Donate For Us

Identify differences in text paragraphs with R

Tags:

string

comparison

r

LAP

1 Answers

akrun

Related questions

Recent Activity

Donate For Us