Removing duplicate words in a string in R

Question

Just to help someone who's just voluntarily removed their question, following a request for code he tried and other comments. Let's assume they tried something like this:

str <- "How do I best try and try and try and find a way to to improve this code?"
d <- unlist(strsplit(str, split=" "))
paste(d[-which(duplicated(d))], collapse = ' ')

and wanted to learn a better way. So what is the best way to remove a duplicate word from the string?

cdeterman · Accepted Answer

If you are still interested in alternate solutions you can use unique which slightly simplifies your code.

paste(unique(d), collapse = ' ')

As per the comment by Thomas, you probably do want to remove punctuation. R's gsub has some nice internal patterns you can use instead of strict regex. Of course you can always specify specific instances if you want to do some more refined regex.

d <- gsub("[[:punct:]]", "", d)

Ashutosh Singh · Answer

To remove duplicate words except for any special characters. use this function

rem_dup_word <- function(x){
x <- tolower(x)
paste(unique(trimws(unlist(strsplit(x,split=" ",fixed=F,perl=T)))),collapse = 
" ")
}

Input data:

duptest <- "Samsung WA80E5LEC samsung Top Loading with Diamond Drum, 6 kg 
(Silver)"

rem_dup_word(duptest)

output: samsung wa80e5lec top loading with diamond drum 6 kg (silver)

It will treat "Samsung" and "SAMSUNG" as duplicate

Removing duplicate words in a string in R

Tags:

r

duplicates

andrekos

2 Answers

cdeterman

Ashutosh Singh

Recent Activity

Donate For Us

Removing duplicate words in a string in R

Tags:

r

duplicates

andrekos

2 Answers

cdeterman

Ashutosh Singh

Related questions

Recent Activity

Donate For Us