Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

php regex to detect text inside brackets ignoring nested brackets

I'm trying to make a php regex work that parses a string for text in brackets while ignoring possible nested brackets:

Let's say I want

Lorem ipsum [1. dolor sit amet, [consectetuer adipiscing] elit.]. Aenean commodo ligula eget dolor.[2. Dolor, [consectetuer adipiscing] elit.] Aenean massa[3. Lorem ipsum] dolor.

to return

[1] => "dolor sit amet, [consectetuer adipiscing] elit."
[2] => "Dolor, [consectetuer adipiscing] elit."
[3] => "Lorem ipsum"

So far i got

'/\[([0-9]+)\.\s([^\]]+)\]/gi'

but it breaks when nested brackets occur. See demo

How can i ignore the inner brackets from detection? Thx in advance!

like image 938
hm711 Avatar asked Sep 30 '15 08:09

hm711


2 Answers

You can use recursive references to previous groups:

(?<no_brackets>[^\[\]]*){0}(?<balanced_brackets>\[\g<no_brackets>\]|\[(?:\g<no_brackets>\g<balanced_brackets>\g<no_brackets>)*\])

See it in action

The idea is to define your desired matches as either something with no brackets, surrounded by [] or something, which contains a sequence of no brackets or balanced brackets with the first rule.

like image 190
ndnenkov Avatar answered Oct 24 '22 18:10

ndnenkov


You can use this pattern that captures the item number and the following text in two different groups. If you are sure all item numbers are unique, you can build the associative array described in your question with a simple array_combine:

$pattern = '~\[ (?:(\d+)\.\s)? ( [^][]*+ (?:(?R) [^][]*)*+ ) ]~x';

if (preg_match_all($pattern, $text, $matches))
    $result =  array_combine($matches[1], $matches[2]);

Pattern details:

~     # pattern delimiter
\[    # literal opening square bracket
(?:(\d+)\.\s)? # optional item number (*) 
(              # capture group 2
   [^][]*+         # all that is not a square bracket (possessive quantifier)
   (?:             # 
       (?R)        # recursion: (?R) is an alias for the whole pattern
       [^][]*      # all that is not a square bracket
   )*+             # repeat zero or more times (possessive quantifier)
)
]                  # literal closing square bracket
~x  # free spacing mode

(*) note that the item number part must be optional if you want to be able to use the recursion with (?R) (for example [consectetuer adipiscing] doesn't have an item number.). This can be problematic if you want to avoid square brackets without item number. In this case you can build a more robust pattern if you change the optional group (?:(\d+)\.\s)? to a conditional statement: (?(R)|(\d+)\.\s)

Conditional statement:

(?(R)        # IF you are in a recursion
             # THEN match this (nothing in our case)
  |          # ELSE
  (\d+)\.\s  #   
)

In this way the item number becomes mandatory.

like image 45
Casimir et Hippolyte Avatar answered Oct 24 '22 18:10

Casimir et Hippolyte