Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove certain words in string from column in dataframe in R

Tags:

r

I have a dataset in R that lists out a bunch of company names and want to remove words like "Inc", "Company", "LLC", etc. for part of a clean-up effort. I have the following sample data:

sampleData

  Location             Company
1 New York, NY         XYZ Company
2 Chicago, IL          Consulting Firm LLC
3 Miami, FL            Smith & Co.

Words I do not want to include in my output:

stopwords = c("Inc","inc","co","Co","Inc.","Co.","LLC","Corporation","Corp","&")

I built the following function to break out each word, remove the stopwords, and then bring the words back together, but it is not iterating through each row of the dataset.

removeWords <- function(str, stopwords) {
  x <- unlist(strsplit(str, " "))
  paste(x[!x %in% stopwords], collapse = " ")
}

removeWords(sampleData$Company,stopwords)

The output for the above function looks like this:

[1] "XYZ Company Consulting Firm Smith"

T he output should be:

 Location              Company
1 New York, NY         XYZ Company
2 Chicago, IL          Consulting Firm
3 Miami, FL            Smith

Any help would be appreciated.

like image 627
Shannon Avatar asked Dec 01 '16 01:12

Shannon


1 Answers

We can use 'tm' package

library(tm)

stopwords = readLines('stopwords.txt')     #Your stop words file
x  = df$company        #Company column data
x  =  removeWords(x,stopwords)     #Remove stopwords

df$company_new <- x     #Add the list as new column and check
like image 194
Hardik Gupta Avatar answered Nov 10 '22 04:11

Hardik Gupta