Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing duplicate words in a string in R

Tags:

r

duplicates

Just to help someone who's just voluntarily removed their question, following a request for code he tried and other comments. Let's assume they tried something like this:

str <- "How do I best try and try and try and find a way to to improve this code?"
d <- unlist(strsplit(str, split=" "))
paste(d[-which(duplicated(d))], collapse = ' ')

and wanted to learn a better way. So what is the best way to remove a duplicate word from the string?

like image 472
andrekos Avatar asked Nov 29 '13 10:11

andrekos


2 Answers

If you are still interested in alternate solutions you can use unique which slightly simplifies your code.

paste(unique(d), collapse = ' ')

As per the comment by Thomas, you probably do want to remove punctuation. R's gsub has some nice internal patterns you can use instead of strict regex. Of course you can always specify specific instances if you want to do some more refined regex.

d <- gsub("[[:punct:]]", "", d)
like image 72
cdeterman Avatar answered Oct 15 '22 18:10

cdeterman


To remove duplicate words except for any special characters. use this function

rem_dup_word <- function(x){
x <- tolower(x)
paste(unique(trimws(unlist(strsplit(x,split=" ",fixed=F,perl=T)))),collapse = 
" ")
}

Input data:

duptest <- "Samsung WA80E5LEC samsung Top Loading with Diamond Drum, 6 kg 
(Silver)"

rem_dup_word(duptest)

output: samsung wa80e5lec top loading with diamond drum 6 kg (silver)

It will treat "Samsung" and "SAMSUNG" as duplicate

like image 42
Ashutosh Singh Avatar answered Oct 15 '22 18:10

Ashutosh Singh