I have a dataset in R that lists out a bunch of company names and want to remove words like "Inc", "Company", "LLC", etc. for part of a clean-up effort. I have the following sample data:
sampleData
Location Company
1 New York, NY XYZ Company
2 Chicago, IL Consulting Firm LLC
3 Miami, FL Smith & Co.
Words I do not want to include in my output:
stopwords = c("Inc","inc","co","Co","Inc.","Co.","LLC","Corporation","Corp","&")
I built the following function to break out each word, remove the stopwords, and then bring the words back together, but it is not iterating through each row of the dataset.
removeWords <- function(str, stopwords) {
x <- unlist(strsplit(str, " "))
paste(x[!x %in% stopwords], collapse = " ")
}
removeWords(sampleData$Company,stopwords)
The output for the above function looks like this:
[1] "XYZ Company Consulting Firm Smith"
T he output should be:
Location Company
1 New York, NY XYZ Company
2 Chicago, IL Consulting Firm
3 Miami, FL Smith
Any help would be appreciated.
We can use 'tm' package
library(tm)
stopwords = readLines('stopwords.txt') #Your stop words file
x = df$company #Company column data
x = removeWords(x,stopwords) #Remove stopwords
df$company_new <- x #Add the list as new column and check
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With