Update/Note:
I think what I'm probably looking for is to get the captures of a group in PHP.
Referenced: PCRE regular expressions using named pattern subroutines.
(Read carefully:)
I have a string that contains a variable number of segments (simplified):
$subject = 'AA BB DD '; // could be 'AA BB DD CC EE ' as well
I would like now to match the segments and return them via the matches array:
$pattern = '/^(([a-z]+) )+$/i';
$result = preg_match_all($pattern, $subject, $matches);
This will only return the last match for the capture group 2: DD
.
Is there a way that I can retrieve all subpattern captures (AA
, BB
, DD
) with one regex execution? Isn't preg_match_all
suitable for this?
Both the $subject
and $pattern
are simplified. Naturally with such the general list of AA
, BB
, .. is much more easy to extract with other functions (e.g. explode
) or with a variation of the $pattern
.
But I'm specifically asking how to return all of the subgroup matches with the preg_...
-family of functions.
For a real life case imagine you have multiple (nested) level of a variant amount of subpattern matches.
This is an example in pseudo code to describe a bit of the background. Imagine the following:
Regular definitions of tokens:
CHARS := [a-z]+
PUNCT := [.,!?]
WS := [ ]
$subject
get's tokenized based on these. The tokenization is stored inside an array of tokens (type, offset, ...).
That array is then transformed into a string, containing one character per token:
CHARS -> "c"
PUNCT -> "p"
WS -> "s"
So that it's now possible to run regular expressions based on tokens (and not character classes etc.) on the token stream string index. E.g.
regex: (cs)?cp
to express one or more group of chars followed by a punctuation.
As I now can express self-defined tokens as regex, the next step was to build the grammar. This is only an example, this is sort of ABNF style:
words = word | (word space)+ word
word = CHARS+
space = WS
punctuation = PUNCT
If I now compile the grammar for words into a (token) regex I would like to have naturally all subgroup matches of each word.
words = (CHARS+) | ( (CHARS+) WS )+ (CHARS+) # words resolved to tokens
words = (c+)|((c+)s)+c+ # words resolved to regex
I could code until this point. Then I ran into the problem that the sub-group matches did only contain their last match.
So I have the option to either create an automata for the grammar on my own (which I would like to prevent to keep the grammar expressions generic) or to somewhat make preg_match working for me somehow so I can spare that.
That's basically all. Probably now it's understandable why I simplified the question.
Related:
Similar thread: Get repeated matches with preg_match_all()
Check the chosen answer plus mine might be useful I will duplicate there:
From http://www.php.net/manual/en/regexp.reference.repetition.php :
When a capturing subpattern is repeated, the value captured is the substring that matched the final iteration.
I personally give up and going to do this in 2 steps.
EDIT:
I see in that other thread someone claimed that lookbehind method is able doing it.
Try this:
preg_match_all("'[^ ]+'i",$text,$n);
$n[0]
will contain an array of all non-space character groups in the text.
Edit: with subgroups:
preg_match_all("'([^ ]+)'i",$text,$n);
Now $n[1]
will contain the subgroup matches, that are exactly the same as $n[0]
. This is pointless actually.
Edit2: nested subgroups example:
$test = "Hello I'm Joe! Hi I'm Jane!";
preg_match_all("/(H(ello|i)) I'm (.*?)!/i",$test,$n);
And the result:
Array
(
[0] => Array
(
[0] => Hello I'm Joe!
[1] => Hi I'm Jane!
)
[1] => Array
(
[0] => Hello
[1] => Hi
)
[2] => Array
(
[0] => ello
[1] => i
)
[3] => Array
(
[0] => Joe
[1] => Jane
)
)
Is there a way that I can retrieve all matches (AA, BB, DD) with one regex execution? Isn't preg_match_all not suitable for this?
Your current regex seems to be for a preg_match() call. Try this instead:
$pattern = '/[a-z]+/i';
$result = preg_match_all($pattern, $subject, $matches);
Per comments, the ruby regex I mentioned:
sentence = %r{
(?<subject> cat | dog ){0}
(?<verb> eats | drinks ){0}
(?<object> water | bones ){0}
(?<adjective> big | smelly ){0}
(?<obj_adj> (\g<adjective>\s)? ){0}
The\s\g<obj_adj>\g<subject>\s\g<verb>\s\g<opt_adj>\g<object>
}x
md = sentence.match("The cat drinks water");
md = sentence.match("The big dog eats smelly bones");
But I think you'll need a lexer/parser/tokenizer to do the same kind of thing in PHP. :-|
You can't extract the subpatterns because the way you wrote your regex returns only one match (using ^
and $
at the same time, and +
on the main pattern).
If you write it this way, you'll see that your subgroups are correctly there:
$pattern = '/(([a-z]+) )/i';
(this still has an unnecessary set of parentheses, I just left it there for illustration)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With