Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Find and remove all one to two letter words

Tags:

regex

r

I am attempting to clean away any one or two letter words from a text passage. This was my first thought

gsub(" [a-zA-Z]{1,2} ", " ", "a ab abc B BB BBB")
[1] "a aaa BB BBBB"

I can see how the "a" is not replaced as it does not lead with a space and I can see how the "BB" is not replaced as the space it leads with has already been grabbed by the " B ".

like image 913
Francis Smart Avatar asked Jul 03 '15 09:07

Francis Smart


2 Answers

You can make use of \b word boundary and [[:alpha:]] bracket expression with {1,2} limiting quantifier, and then trim the leading/trailing spaces and shrink multiple spaces into 1:

tr <- "a ab abc B BB BBB f"
tr <- gsub(" *\\b[[:alpha:]]{1,2}\\b *", " ", tr) # Remove 1-2 letter words
gsub("^ +| +$|( ) +", "\\1", tr) # Remove excessive spacing

Result:

[1] "abc BBB"

See IDEONE demo

like image 181
Wiktor Stribiżew Avatar answered Nov 10 '22 13:11

Wiktor Stribiżew


Use the below Perl regex .

x <- gsub("\\s*(?<!\\S)[a-zA-Z]{1,2}(?!\\S)", "", "a ab abc B BB BBB", perl=T)
gsub("^\\s+", "", x)
like image 30
Avinash Raj Avatar answered Nov 10 '22 12:11

Avinash Raj