Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R remove multiple text strings in data frame

Tags:

r

keyword

gsub

New to R. I am looking to remove certain words from a data frame. Since there are multiple words, I would like to define this list of words as a string, and use gsub to remove. Then convert back to a dataframe and maintain same structure.

wordstoremove <- c("ai", "computing", "ulitzer", "ibm", "privacy", "cognitive")

a
id                text time      username          
 1     "ai and x"        10     "me"          
 2     "and computing"   5      "you"         
 3     "nothing"         15     "everyone"     
 4     "ibm privacy"     0      "know"        

I was thinking something like:

a2 <- apply(a, 1, gsub(wordstoremove, "", a)

but clearly this doesnt work, before converting back to a data frame.

like image 774
lmcshane Avatar asked Jul 09 '14 04:07

lmcshane


2 Answers

wordstoremove <- c("ai", "computing", "ulitzer", "ibm", "privacy", "cognitive")

(dat <- read.table(header = TRUE, text = 'id text time username
1 "ai and x" 10 "me"
2 "and computing" 5 "you"
3 "nothing" 15 "everyone"
4 "ibm privacy" 0 "know"'))

#   id          text time username
# 1  1      ai and x   10       me
# 2  2 and computing    5      you
# 3  3       nothing   15 everyone
# 4  4   ibm privacy    0     know

(dat1 <- as.data.frame(sapply(dat, function(x) 
  gsub(paste(wordstoremove, collapse = '|'), '', x))))

#   id    text time username
# 1  1   and x   10       me
# 2  2    and     5      you
# 3  3 nothing   15 everyone
# 4  4            0     know
like image 150
rawr Avatar answered Nov 14 '22 23:11

rawr


Another option using dplyr::mutate() and stringr::str_remove_all():

library(dplyr)
library(stringr)

dat <- dat %>%   
  mutate(text = str_remove_all(text, regex(str_c("\\b",wordstoremove, "\\b", collapse = '|'), ignore_case = T)))

Because lowercase 'ai' could easily be a part of a longer word, the words to remove are bound with \\b so that they are not removed from the beginning, middle, or end or other words.

The search pattern is also wrapped with regex(pattern, ignore_case = T) in case some words are capitalized in the text string.

str_replace_all() could be used if you wanted to replace the words with something other than just removing them. str_remove_all() is just an alias for str_replace_all(string, pattern, '').

rawr's anwswer could be updated to:

dat1 <- as.data.frame(sapply(dat, function(x) 
  gsub(paste0('\\b', wordstoremove, '\\b', collapse = '|'), '', x, ignore.case = T)))
like image 31
sbha Avatar answered Nov 14 '22 21:11

sbha