Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Look Behind Regex Assertion to Match Nested HTML Table

I'm looking to match a specific table within a table. Here's the sample html and a summary of my failed attempts thus far:

<table id="parent">
    <table class="possible_target">
            <tr><td>We're tageting this table</td></tr>
    </table>
</table>
<table class="possible_target">
    <tr><td>We're not targeteing this table</td></tr>
</table>

Here was my initial attempt. But even if it worked, it would likely match the second, unnested table as well:

~(?=<table.*?)<table class="possible_target".*?</table>~si

Here is my sudo expression for what I'm trying to accomplish. It would assert the presence of an opening table tag as well as the absence of a closing table tag before making the match:

~(?=<table.*?)(?!</table>)<table class="possible_target".*?</table>~si
like image 463
Brandon Buster Avatar asked Dec 10 '25 19:12

Brandon Buster


1 Answers

I found that interesting, as it's challenging to work with regex and nested html-tags.

My attempt does (should do) the following:

1.) Enumerate the tables by depth using a callback function. Lowest depth = 1

// html stuff to process
$source = "your input";

// specify tag to match
$rx_tag = "table";

// match all $rx_tag in depth (lowest = 1)
$rx_depth = 2;

// ----------------------------

// set markers using callback function
$source = preg_replace_callback('~<(/)?'.$rx_tag.'~i','set_tag_depth',$source);

function set_tag_depth($out)
{
  global $tag_depth;

  if($out[1]=="/") {
    $tag_depth--; return $out[0].($tag_depth+1);
  }

  $tag_depth++; return $out[0].$tag_depth;
}

#echo nl2br(htmlspecialchars($source));

2.) Tables are now renamed to depth e.g. <table2 ... </table2> for all tables inside <table1, <table3 ... </table3> for tables inside <table2 and so on. Now it's easy to match the tables in the desired depth. Then strip the enumeration, in case you need the original source again.

// get specified tags in desired depth
$pattern = '~<'.$rx_tag.(string)$rx_depth.'.*</'.$rx_tag.(string)$rx_depth.'>~Uis';
preg_match_all($pattern,$source,$out);

// strip markers
if(!empty($out[0]))
{
  foreach($out[0] AS $v)
  {
    $v = preg_replace('~(</?'.$rx_tag.')\d+~i','\1',$v);

    // test output
    echo nl2br(htmlspecialchars($v))."<br>------------------------------<br>";
  }
}

$source = preg_replace('~(</?'.$rx_tag.')\d+~i','\1',$source);

It's intended, that if a tag contains tags of the same kind, those are not stripped from the parent e.g. <table2>...</table2> might contain <table3>...</table3> ... <table3>...</table3>. Set $rx_depth = 3; to get those.

I hope it works as it should, was very tired already :-) It's designed to work with any kinds of tags, but didn't test it much. At least an idea.

like image 169
Jonny 5 Avatar answered Dec 14 '25 07:12

Jonny 5