Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing StopWords from a Character using R

Tags:

r

gsub

Consider that I have the below mentioned String;

str_input <- c("Mellanox,Asia, China, India, JAVA, United States, APIs")

I have used the below mentioned gsub code which removes my specific StopWords.

gsub(paste0("\\b(",paste(location_sw, collapse="|"),")\\b"), "", str_input)

where, location_sw consists of my list of stopwords as mentioned below

location_sw <- c('Rose', 'Java', 'JAVA', 'Mellanox', 'Microsoft', '144GiB', 'West',
                 'Amazon', 'Channel Asia', 'jClarity', 'APIs')

On using the above provided gsub code, I am getting the below mentioned output

",Asia, China, India, , United States, "

However, I would like the following outcome;

"Asia, China, India, United States"

I would like to remove the commas present after removing the stopwords. Any inputs will be really helpfull.

like image 915
JBH Avatar asked Jan 26 '23 22:01

JBH


1 Answers

Another approach is to strsplit the string into a character vector and then taking the setdiff with respect to location_sw:

out <- setdiff(strsplit(str_input, split = ",\\s*")[[1]], location_sw)
out
#> [1] "Asia"          "China"         "India"         "United States"

If necessary, we can paste it back to a character:

paste(out, collapse = ", ")
#> [1] "Asia, China, India, United States"
like image 69
Joris C. Avatar answered Feb 03 '23 15:02

Joris C.