Lets say our pattern is a regex for capital letters (but we could have a more complex pattern than searching for capitals)
To find at least n consecutive patterns (in this case, the pattern we are looking for is simply a capital letter), we can do this:
(Using Ruby)
somestring = "ABC deFgHij kLmN pQrS XYZ abcdEf"
at_least_2_capitals = somestring.scan(/[A-Z][A-Z]+/)
=> ["ABC", "XYZ"]
at_least_3_capitals = somestring.scan(/[A-Z]{3}[A-Z]*/)
=> ["ABC", "XYZ"]
However, how do I search for at most n consecutive patterns, for example, at most one consecutive capital letter:
matches = somestring.scan(/ ??? /)
=> [" deFgHij kLmN pQrS ", " abcdEf"]
I read that I need to negate the "at least" regex, by turning it into a DFA, negating the accept states, (then converting it back to NFA, though we can leave it as it is) so to write it as a regex. If we think of encountering our pattern as receiving a '1' and not receiving the pattern as receiving a '0', we can draw a simple DFA diagram (where n=1, we want at most one of our pattern):
Specifically, I was wondering how this becomes a regex. Generally, I hope to find how to find "at most" with regex, as my regex skills feel stunted with "at least" alone.
Note that this question is not a dupicate of this post, as using the accepted methodology there would give:
somestring.scan(/[A-Z]{2}[A-Z]*(.*)[A-Z]{2}[A-Z]*/)
=> [[" deFgHij kLmN pQrS X"]]
Which is not what the DFA shows, not just because it misses the second sought match - more importantly that it includes the 'X', which it should not, as 'X' is followed by another capital, and from the DFA we see that a capital which is followed by another capital is not an accept state.
You could suggest
somestring.split(/[A-Z]{2}[A-Z]*/)
=> ["", " deFgHij kLmN pQrS ", " abcdEf"]
(Thanks to Rubber Duck)
but I still want to know how to find at most n occurrences using regex alone. (For knowledge!)
There are a few problems with your current attempt.
X
is part of the match is that .*
is greedy and consumes as much as possible - hence, leaving only the required two capital letters to be matched by the trailing bit. This could be fixed with a non-greedy quantifier.Fixing all of these with the current approach is hard (I tried and failed - check the edit history of this post if you want to see my attempt until I decided to scrap this approach altogether). So let's try something else!
What is it that we want to match? Disregarding the edge cases, where the match starts at the beginning of the string or ends at the end of the string, we want to match:
(non-caps) 1 cap (non-caps) 1 cap (non-caps) ....
This is ideal for Jeffrey Friedl's unrolling-the-loop. Which looks like
[^A-Z]+(?:[A-Z][^A-Z]+)*
Now what about the edge cases? We can phrase them like this:
To add these to our pattern, we simply group a capital letter with the appropriate anchor and mark both together as optional:
(?:^[A-Z])?[^A-Z]+(?:[A-Z][^A-Z]+)*(?:[A-Z]$)?
Now it's really working. And even better, we don't need capturing any more!
This solution is easily generalized to the case of "at most n consecutive capital letters", by changing each [A-Z]
to [A-Z]{1,n}
and thereby allowing up to n
capital letters where there is only one allowed so far.
See the demo for n = 2
.
To match words containing at most N
PATTERN
s, use the regex
/\b(?:\w(?:(?<!PATTERN)|(?!(?:PATTERN){N,})))+\b/
For example, to match words containing at most 1 capital letter,
/\b(?:\w(?:(?<![A-Z])|(?!(?:[A-Z]){1,})))+\b/
This works for multi-character patterns too.
I'm afraid your examples may cause confusion. Let's add a few words:
somestring = "ABC deFgHij kLmN pQrS XYZ abcdEf mixedCaps mixeDCaps mIxedCaps mIxeDCaps T TT t tt"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Now, rerunning your at-least-2-capitals regex returns
at_least_2_capitals = somestring.scan(/[A-Z][A-Z]+/)
=> ["ABC", "XYZ", "DC", "DC", "TT"]
Note how complete words are not captured! Are you sure this is what you wanted? I ask, of course, because in your latter examples, your at-most-1-capital regex returns complete words, instead of just the capital letters being captured.
Here's the solution either way.
First, for matching just patterns (and not entire words, as consistent with your initial examples), here's a regex for at-most-N
-PATTERN
s:
/(?<!PATTERN)(?!(?:PATTERN){N+1,})(?:PATTERN)+/
For example, the at-most-1-capitals regex would be
/(?<![A-Z])(?!(?:[A-Z]){2,})(?:[A-Z])+/
and returns
=> ["F", "H", "L", "N", "Q", "S", "E", "C", "DC", "I", "C", "I", "DC", "T", "TT"]
To further exemplify, the at-most-2-capitals regex returns
=>
Finally, if you wanted to match entire words that contained at most a certain number of consecutive patterns, then here's a whole different approach:
/\b(?:\w(?:(?<![A-Z])|(?![A-Z]{1,})))+\b/
This returns
["deFgHij", "kLmN", "pQrS", "abcdEf", "mixedCaps", "mIxedCaps", "T", "t", "tt"]
The general form is
/\b(?:\w(?:(?<!PATTERN)|(?!(?:PATTERN){N,})))+\b/
You can see all these examples at http://ideone.com/hImmZr.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With