Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

grab n letter words don't count apostrophes regex

Tags:

regex

r

I'm trying to learn regex in R more deeply. I gave myself what I thought was an easy task that I can't figure out. I want to extract all 4 letter words. In these four letter words I want to ignore (don't count) apostrophes. I can do this without regex but want a regex solution. Here's a MWE and what I've tried:

text.var <- "This Jon's dogs' 'bout there in Mike's re'y word." pattern <- "\\b[A-Za-z]{4}\\b(?!')" pattern <- "\\b[A-Za-z]{4}\\b|\\b[A-Za-z']{5}\\b"  regmatches(text.var, gregexpr(pattern, text.var, perl = TRUE))  

** Desired output:**

[[1]] [1] "This"  "Jon's"  "dogs'"  "'bout"  "word" 

I thought the second pattern would work but it grabs words containing 5 characters as well.

like image 463
Tyler Rinker Avatar asked Aug 11 '14 00:08

Tyler Rinker


People also ask

How do you match a word in regex?

To run a “whole words only” search using a regular expression, simply place the word between two word boundaries, as we did with ‹ \bcat\b ›. The first ‹ \b › requires the ‹ c › to occur at the very start of the string, or after a nonword character.

Is hyphen A word character in regex?

In regular expressions, the hyphen ("-") notation has special meaning; it indicates a range that would match any number from 0 to 9. As a result, you must escape the "-" character with a forward slash ("\") when matching the literal hyphens in a social security number.


2 Answers

This is a good challenging question and here is a tricky answer.

> x  <- "This Jon's dogs' 'bout there in Mike's re'y word." > re <- "(?i)('?[a-z]){5,}(*SKIP)(?!)|('?[a-z]){4}'?" > regmatches(x, gregexpr(re, x, perl=T))[[1]] ## [1] "This"  "Jon's" "dogs'" "'bout" "word"  

Explanation:

The idea is to skip any word patterns that consist of 5 or more letter characters and an optional apostrophe.

On the left side of the alternation operator we match the subpattern we do not want. Making it fail and forcing the regular expression engine to not retry the substring using backtracking control. As explained below:

(*SKIP) # advances to the position in the string where (*SKIP) was          # encountered signifying that what was matched leading up          # to cannot be part of the match  (?!)    # equivalent to (*FAIL), causes matching failure,          # forcing backtracking to occur 

The right side of the alternation operator matches what we want...

Additional Explanation:

  • Essentially, in simple terms you are using the discard technique.

    (?:'?[a-z]){5,}|((?:'?[a-z]){4}'?) 

    You use the alternation operator in context placing what you want to exclude on the left, ( saying throw this away, it's garbage ) and place what you want to match in a capturing group on the right side.

like image 165
14 revs, 3 users 82% Avatar answered Sep 22 '22 06:09

14 revs, 3 users 82%


You can use this pattern:

(?i)(?<![a-z'])(?:'?[a-z]){4}'?(?![a-z']) 
like image 45
Casimir et Hippolyte Avatar answered Sep 22 '22 06:09

Casimir et Hippolyte