I am attempting to clean away any one or two letter words from a text passage. This was my first thought
gsub(" [a-zA-Z]{1,2} ", " ", "a ab abc B BB BBB")
[1] "a aaa BB BBBB"
I can see how the "a" is not replaced as it does not lead with a space and I can see how the "BB" is not replaced as the space it leads with has already been grabbed by the " B ".
You can make use of \b
word boundary and [[:alpha:]]
bracket expression with {1,2}
limiting quantifier, and then trim the leading/trailing spaces and shrink multiple spaces into 1:
tr <- "a ab abc B BB BBB f"
tr <- gsub(" *\\b[[:alpha:]]{1,2}\\b *", " ", tr) # Remove 1-2 letter words
gsub("^ +| +$|( ) +", "\\1", tr) # Remove excessive spacing
Result:
[1] "abc BBB"
See IDEONE demo
Use the below Perl regex .
x <- gsub("\\s*(?<!\\S)[a-zA-Z]{1,2}(?!\\S)", "", "a ab abc B BB BBB", perl=T)
gsub("^\\s+", "", x)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With