Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex / DOMDocument - match and replace text not in a link

I need to find and replace all text matches in a case insensitive way, unless the text is within an anchor tag - for example:

<p>Match this text and replace it</p>
<p>Don't <a href="/">match this text</a></p>
<p>We still need to match this text and replace it</p>

Searching for 'match this text' would only replace the first instance and last instance.

[Edit] As per Gordon's comment, it may be preferred to use DOMDocument in this instance. I'm not at all familiar with the DOMDocument extension, and would really appreciate some basic examples for this functionality.

like image 563
BrynJ Avatar asked Oct 28 '10 16:10

BrynJ


1 Answers

Here is an UTF-8 safe solution, which not only works with properly formatted documents, but also with document fragments.

The mb_convert_encoding is needed, because loadHtml() seems to has a bug with UTF-8 encoding (see here and here).

The mb_substr is trimming the body tag from the output, this way you get back your original content without any additional markup.

<?php
$html = '<p>Match this text and replace it</p>
<p>Don\'t <a href="/">match this text</a></p>
<p>We still need to match this text and replace itŐŰ</p>
<p>This is <a href="#">a link <span>with <strong>don\'t match this text</strong> content</span></a></p>';

$dom = new DOMDocument();
// loadXml needs properly formatted documents, so it's better to use loadHtml, but it needs a hack to properly handle UTF-8 encoding
$dom->loadHtml(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"));

$xpath = new DOMXPath($dom);

foreach($xpath->query('//text()[not(ancestor::a)]') as $node)
{
    $replaced = str_ireplace('match this text', 'MATCH', $node->wholeText);
    $newNode  = $dom->createDocumentFragment();
    $newNode->appendXML($replaced);
    $node->parentNode->replaceChild($newNode, $node);
}

// get only the body tag with its contents, then trim the body tag itself to get only the original content
echo mb_substr($dom->saveXML($xpath->query('//body')->item(0)), 6, -7, "UTF-8");

References:
1. find and replace keywords by hyperlinks in an html fragment, via php dom
2. Regex / DOMDocument - match and replace text not in a link
3. php problem with russian language
4. Why Does DOM Change Encoding?

I read dozens of answers in the subject, so I am sorry if I forgot somebody (please comment it and I will add yours as well in this case).

Thanks for Gordon and stillstanding for commenting on my other answer.

like image 130
István Ujj-Mészáros Avatar answered Oct 13 '22 11:10

István Ujj-Mészáros