Note: This is a question about possibilities of modern regex flavors. It's not about the best way to solve this using other methods. It's inspired by an earlier question, but that one is not restricted to regex.
In an ASCII "image"/art/map/string like:
....X....... ..X..X...X.... X.X...X..X..... X....XXXXXX..... X..XXX........... .....X.......... ..............X ..X...........X.... ..X...........X....X... ....X.....
I'd like to find a simple vertical line formation of three X
s:
X X X
The number of lines is variable in the image, and the width of each line is variable too.
With regex (PCRE/PHP, Perl, .NET or similar) is it possible to:
Basically (0+1)* mathes any sequence of ones and zeroes. So, in your example (0+1)*1(0+1)* should match any sequence that has 1. It would not match 000 , but it would match 010 , 1 , 111 etc. (0+1) means 0 OR 1.
To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).
The vertical bar symbol acts as an OR operator and matches the values to the left and right of the vertical bar. For example, the regular expression Jack|Jill matches "Jack" and "Jill". \ (backslash) The backslash symbol acts as an escape sequence. Use it when you want search for a regular expression symbol.
\\. matches the literal character . . the first backslash is interpreted as an escape character by the Emacs string reader, which combined with the second backslash, inserts a literal backslash character into the string being read. the regular expression engine receives the string \.
To answer the first question one could use:
(?xm) # ignore comments and whitespace, ^ matches beginning of line ^ # beginning of line (?: . # any character except \n (?= # lookahead .*+\n # go to next line ( \1?+ . ) # add a character to the 1st capturing group .*+\n # next line ( \2?+ . ) # add a character to the 2nd capturing group ) )*? # repeat as few times as needed X .*+\n # X on the first line and advance to next line \1?+ # if 1st capturing group is defined, use it, consuming exactly the same number of characters as on the first line X .*+\n # X on the 2nd line and advance to next line \2?+ # if 2st capturing group is defined, use it, consuming exactly the same number of characters as on the first line X # X on the 3rd line
Online demo
This expression works in Perl, PCRE, Java and should work in .NET.
The expression uses lookaheads with self referencing capturing groups to add a character for every repetition of the lookahead (this is used to "count").
\1?+
means if \1
matches (or is defined) consume it, and don't give it back (don't backtrack). In this case it's equivalent to (?(1) \1 )
. Which means match \1
if \1
is defined.
polygenelubricants explains this kinds of lookaheads with backreferences very nicely in his answer for How can we match a^n b^n with Java regex?. (He has also written about other impressive tricks for Java regex involving backreferences and lookarounds.)
When just using matching and requiring the answer (count) in the number of matches, then the question 2 answer would be:
It can not be directly solved in regex flavors that have a limited lookbehind. While other flavors like Java and .NET could (as for example in m.buettner's .NET solution).
Thus plain regex matches in Perl and PCRE (PHP, etc) cannot directly answer this question in this case.
Assume that no variable length lookbehinds are available.
You have to in some way count the number of characters on a line before an X
.
Only way to do that is to match them, and since no variable length lookbehinds are available you have to start the match (at least) at the beginning of the line.
If you start the match at the beginning of a line you can only get at most one match per line.
Since there can be multiple occurrences per line, this would not count them all and would not give a correct answer.
On the other hand if we accept the answer as the length of a match or substitution result, then the 2nd question can be answered in PCRE and Perl (and other flavors).
This solution is based on/inspired by m.buettner's nice "partial PCRE solution".
One could simply replace all matches of the following expression with $3
, getting the answer to question two (the number of patterns of interests) as the length of the resulting string.
^ (?: (?: # match .+? characters . (?= # counting the same number on the following two lines .*+\n ( \1?+ . ) .*+\n ( \2?+ . ) ) )+? (?<= X ) # till the above consumes an X (?= # that matches the following conditions .*+\n \1?+ (?<= X ) .*+\n \2?+ (?<= X ) ) (?= # count the number of matches .*+\n ( \3?+ . ) # the number of matches = length of $3 ) )* # repeat as long as there are matches on this line .*\n? # remove the rest of the line
Which in Perl could be written as:
$in =~ s/regex/$3/gmx; $count = length $in;
Online demo
This expression is similar to the solution to question 1 above, with some modifications to include X
in the characters matched in the first lookahead, wrapped with a quantifier and counting number of matches of the quantifier.
Except for direct matches this is as close as it gets (extra code wise besides regex), and could be an acceptable answer to question 2.
Some test cases and results for the above solution. Result showing the numerical answer (length of the resulting string) and in parenthesis the resulting string after the substitution(s).
Test #0: -------------------- X X X result: 1 (X) Test #1: -------------------- ..X.... ..X.... ..X.... result: 1 (.) Test #2: -------------------- ..X.X.. ..X.X.. ....X.. result: 1 (.) Test #3: -------------------- ..X.... ..X.... ...X... result: 0 () Test #4: -------------------- ..X.... ...X... ..X.... result: 0 () Test #5: -------------------- ....X.. .X..X.. .X..... result: 0 () Test #6: -------------------- .X..X.. .X.X... .X.X... result: 1 (.) Test #7: -------------------- .X..X.. .X..X.. .X..X.. result: 2 (.X) Test #8: -------------------- XXX XXX XXX result: 3 (XXX) Test #9: -------------------- X.X.X XXXXX XXXXX .X.X. result: 5 (XXXXX) Test #10: -------------------- 1....X....... 2..X..X...X.... 3X.X...X..X..... 4X....XXXXXX..... 5X..XXX........... 6.....X.......... 7.........X....X 8..X......X....X.... 9..X......X....X....X... A....X..... B.X..X.. C..... XXX XXX XXX . result: 8 (3458.XXX)
The following solutions have two grave problems:
XXX
sequences starting on the same line, as the pos
advances too much.X
are above each other. There don't neccessarily have to be three in a row.Therefore, all upvotes (and the bounty) should go to either of m.buettner's comprehensive .NET answer or the fascinating PCRE answer from Qtax himself.
This is an answer using embedding of Perl code into regexes. Because a Perl regex can use code to assert arbitrary conditions inside regexes or emit partial regexes, they are not limited to matching regular languages or context-free languages, but can match some parts of languages higher up in the Chomsky hierarchy.
The language you want to match can be described in regex terms as
^ .{n} X .*\n .{n} X .*\n .{n} X
where n
is a number. This is about as complex as matching the anbncn language which is the canonical example for a context-sensitive language.
We can match the first line easily, and use some Perl code to emit the regex for the other lines:
/^ (.*?) X (?: .*\n (??{"." x length($1)}) X){2} /mx
That was short! What does it do?
^ (.*?) X
anchores at the start of a line, matches as few non-newline characters as possible and then the X
. We remember the line up to the X
as capture group $1
.
We repeat a group two times which matches the rest of the line, a newline, and then injects a regex that matches a string of the same length as $1
. After that, there must be an X
.
This regex will now match every string that has three X
on top of each other.
If we want to extract all such sequences, we'll have to be nifty. Because sequences may overlap, e.g.
.X XX XX X.
the position where the next match starts must not proceed past the first X
. We can do this via a lookbehind and lookahead. Perl only supports constant-length lookbehind, but has the \K
escape which provides similar semantics. Thus
/^ (.*?) \K X (?=( (?: .*\n (??{"."x length($1)}) X ){2} )) /gmx
will match every sequence of three vertical X
es. Testing time:
$ perl -E'my$_=join"",<>; say "===\n$1X$2" while /^(.*?)\KX(?=((?:.*\n(??{"."x length($1)})X){2}))/gmx' <<'END' ....X....... ..X..X...X.... X.X...X..X..... X....XXXXXX..... X..XXX........... .....X.......... ..............X ..X...........X.... ..X...........X....X... ....X..... END === ..X..X...X.... X.X...X..X..... X....XXXXX === X.X...X..X..... X....XXXXXX..... X === X....XXXXXX..... X..XXX........... .....X === ..............X ..X...........X.... ..X...........X
Note: this relies on experimental regex features that are available from at least Perl 5, v10 onward. The code was tested with a v16 perl.
Let us look at the lines
...X...\n ...X..\n
We want to assert that the leading ...
part of each line is of same length. We can do so by recursion with base case X.*\n
:
(X.*\n|.(?-1).)X
If we anchor that at the start of a line, we can match two vertical X
es. To match more than two lines, we have to do the recursion in a lookahead and then advance the match position to the next line, and repeat. For this, we simply match .*\n
.
This results in the following regex which can match a string with three vertical X
es:
/ ^ (?: (?=( X.*\n | .(?-1). ) X) .*\n # go to next line ){2} /mx
But this isn't good enough, as we want to match all such sequences. To do this, we essentially put the whole regex into a lookahead. The regex engine makes sure to advance the position every time to produce a new match.
/ ^ (?= ( (?: (?= (X.*\n | .(?-1). ) X) .*\n # go to next line ){2} .* # include next line in $1 ) ) /mx
Testing time:
$ perl -E'my$_=join"",<>; say "===\n$1" while /^(?=((?:(?=(X.*\n|.(?-1).)X).*\n){2}.*))/gmx' <<'END' ....X....... ..X..X...X.... X.X...X..X..... X....XXXXXX..... X..XXX........... .....X.......... ..............X ..X...........X.... ..X...........X....X... ....X..... END === ..X..X...X.... X.X...X..X..... X....XXXXXX..... === X.X...X..X..... X....XXXXXX..... X..XXX........... === X....XXXXXX..... X..XXX........... .....X.......... === ..............X ..X...........X.... ..X...........X....X...
So this works as well as the solution with embedded code, that is, it matches each group of lines with vertical X
es, not each group of X
es. (Actually, this solution seems more fragile to me than embedded code)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With