Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing irrelevant characters from a sentence

Tags:

r

I have the following sentence:

**I**%%AABB%&&**WANT**%%AO%**ONLY**%RA%$**THESE**

I would like to extract only those words that are defined as relevant: I, WANT, ONLY, THESE, WORDS, NEXT, STEP. All other characters (numeric, alpha, special) should be removed from the sentence.

In this case, the resulting sentence would be:

I WANT ONLY THESE.

I have thousands of lines like these and each has its own set of characters between the useful words. Is there an efficient way to get rid of these in R?

like image 666
Ravi Avatar asked Nov 21 '25 17:11

Ravi


2 Answers

string <- "**I**%%AABB%&&**WANT**%%AO%**ONLY**%RA%$**THESE**"
regmatches(string, gregexpr("I|WANT|ONLY|THESE|WORDS|NEXT|STEP", 
                            string))

[[1]]
[1] "I"     "WANT"  "ONLY"  "THESE"

EDIT: If you want to then convert back to a sentence, say I store the matches in an object called matches:

sentencify <- function(sentence){
  paste0(paste(sentence, collapse=" "), ".")
}

lapply(matched, sentencify)

[[1]]
[1] "I WANT ONLY THESE."
like image 58
sebastian-c Avatar answered Nov 23 '25 08:11

sebastian-c


Here is one approach, assuming you have a list to check against:

> mystring2 <- "**I**%%AABB%&&**WANT**%%AO%**ONLY**%RA%$**THESE**"
> mystring2
[1] "**I**%%AABB%&&**WANT**%%AO%**ONLY**%RA%$**THESE**"
> temp <- strsplit(mystring2, "[^a-zA-Z]")[[1]]
> myWords <- c("I", "WANT", "ONLY", "THESE", "WORDS", "NEXT", "STEP")
> temp[temp %in% myWords]
[1] "I"     "WANT"  "ONLY"  "THESE"
like image 42
A5C1D2H2I1M1N2O1R2T1 Avatar answered Nov 23 '25 07:11

A5C1D2H2I1M1N2O1R2T1



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!