I'm trying to learn regex in R more deeply. I gave myself what I thought was an easy task that I can't figure out. I want to extract all 4 letter words. In these four letter words I want to ignore (don't count) apostrophes. I can do this without regex but want a regex solution. Here's a MWE and what I've tried:
text.var <- "This Jon's dogs' 'bout there in Mike's re'y word." pattern <- "\\b[A-Za-z]{4}\\b(?!')" pattern <- "\\b[A-Za-z]{4}\\b|\\b[A-Za-z']{5}\\b" regmatches(text.var, gregexpr(pattern, text.var, perl = TRUE))
** Desired output:**
[[1]] [1] "This" "Jon's" "dogs'" "'bout" "word"
I thought the second pattern would work but it grabs words containing 5 characters as well.
To run a “whole words only” search using a regular expression, simply place the word between two word boundaries, as we did with ‹ \bcat\b ›. The first ‹ \b › requires the ‹ c › to occur at the very start of the string, or after a nonword character.
In regular expressions, the hyphen ("-") notation has special meaning; it indicates a range that would match any number from 0 to 9. As a result, you must escape the "-" character with a forward slash ("\") when matching the literal hyphens in a social security number.
This is a good challenging question and here is a tricky answer.
> x <- "This Jon's dogs' 'bout there in Mike's re'y word." > re <- "(?i)('?[a-z]){5,}(*SKIP)(?!)|('?[a-z]){4}'?" > regmatches(x, gregexpr(re, x, perl=T))[[1]] ## [1] "This" "Jon's" "dogs'" "'bout" "word"
Explanation:
The idea is to skip any word patterns that consist of 5 or more letter characters and an optional apostrophe.
On the left side of the alternation operator we match the subpattern we do not want. Making it fail and forcing the regular expression engine to not retry the substring using backtracking control. As explained below:
(*SKIP) # advances to the position in the string where (*SKIP) was # encountered signifying that what was matched leading up # to cannot be part of the match (?!) # equivalent to (*FAIL), causes matching failure, # forcing backtracking to occur
The right side of the alternation operator matches what we want...
Essentially, in simple terms you are using the discard technique.
(?:'?[a-z]){5,}|((?:'?[a-z]){4}'?)
You use the alternation operator in context placing what you want to exclude on the left, ( saying throw this away, it's garbage ) and place what you want to match in a capturing group on the right side.
You can use this pattern:
(?i)(?<![a-z'])(?:'?[a-z]){4}'?(?![a-z'])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With