Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to remove duplicate sentences / blocks of text in R?

I was wondering whether or not it was possible to remove duplicate sentences or even duplicated blocks of texts, meaning a duplicate set of sentences from a dataframe in R. In my specific case, you could imagine I have saved the posts of a forum but have not highlighted when a person quoted a post that has been made before, and now want to remove all quotes from the different cells containing the different posts. Thanks for any tips or hints.

An example could look something like this:

    names <- c("Richard", "Mortimer", "Elizabeth", "Jeremiah")
    posts <- c("I'm trying to find a solution for a problem with my neighbour, she keeps mowing the lawn on sundays when I'm trying to sleep in from my night shift", "Personally, I like to deal with annoying neighbours by just straight up confronting them. Don't shy away. There are always ways to work things out.", "Personally, I like to deal with annoying neighbours by just straight up confronting them. Don't shy away. There are always ways to work things out. That sounds quite aggressive. How about just talking to them in a friendly way, first?", "That sounds quite aggressive. How about just talking to them in a friendly way, first? Didn't mean to sound aggressive, rather meant just being straightforward, if that makes any sense")

    duplicateposts <- data.frame(names, posts)

    posts2 <- c("I'm trying to find a solution for a problem with my neighbour, she keeps mowing the lawn on sundays when I'm trying to sleep in from my night shift", "Personally, I like to deal with annoying neighbours by just straight up confronting them. Don't shy away. There are always ways to work things out.", "That sounds quite aggressive. How about just talking to them in a friendly way, first?", "Didn't mean to sound aggressive, rather meant just being straightforward, if that makes any sense")

    postsnoduplicates <- data.frame(names, posts2)
like image 758
psyph Avatar asked Oct 23 '25 08:10

psyph


1 Answers

I think you need to strsplit at the point of sentence ends, find duplicates, then paste back together. Something like:

spl <- strsplit(as.character(duplicateposts$posts), "(?<=[.?!])(?=.)", perl=TRUE)
spl <- lapply(spl, trimws)
spl <- stack(setNames(spl, duplicateposts$names))
aggregate(values ~ ind, data=spl[!duplicated(spl$values),], FUN=paste, collapse=" ")

Resulting in:

#        ind                                                                                                                                              values
#1   Richard I'm trying to find a solution for a problem with my neighbour, she keeps mowing the lawn on sundays when I'm trying to sleep in from my night shift
#2  Mortimer Personally, I like to deal with annoying neighbours by just straight up confronting them. Don't shy away. There are always ways to work things out.
#3 Elizabeth                                                              That sounds quite aggressive. How about just talking to them in a friendly way, first?
#4  Jeremiah                                                   Didn't mean to sound aggressive, rather meant just being straightforward, if that makes any sense
like image 192
thelatemail Avatar answered Oct 24 '25 22:10

thelatemail