Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Search HTML for 2 phrases (ignoring all tags) and strip everything else

Tags:

html

dom

regex

php

I have html code stored in a string, example:

$html = '
        <html>
        <body>
        <p>Hello <em>進撃の巨人</em>!</p>
        random code
        random code
        <p>Lorem <span>ipsum<span>.</p>
        </body>
        </html>
        ';

Then I have two sentences stored in variables:

$begin = 'Hello 進撃の巨人!';
$end = 'Lorem ipsum.';

I want to search $html for these two sentences, and strip everything before and after them. So $html will become:

$html = 'Hello <em>進撃の巨人</em>!</p>
        random code
        random code
        <p>Lorem <span>ipsum<span>.';

How can I achieve this? Note that the $begin and $end variables do not have html tags but the sentences in $html very likely do have tags as shown above.

Maybe a regex approach?

What I've tried so far

  • A strpos() approach. The problem is that $html contains tags in the sentences, making the $begin and $end sentences not match. I can strip_tags($html) before running strpos(), but then I will obviously end up with $html without the tags.

  • Search part of variable, like Hello, but that's never safe and will give many matches.

like image 542
Henrik Petterson Avatar asked Apr 18 '16 09:04

Henrik Petterson


3 Answers

Here is a short, yet - I believe - working solution based on a lazy dot matching regex (that can be improved by creating a longer, unrolled regex, but should be enough unless you have really large chunks of text).

$html = "<html>\n<body>\n<p><p>H<div>ello</div><script></script> <em>進&nbsp;&nbsp;&nbsp;撃の巨人</em>!</p>\nrandom code\nrandom code\n<p>Lorem <span>ipsum<span>.</p>\n</body>\n </html>";
$begin = 'Hello     進撃の巨人!';
$end = 'Lorem ipsum.';
$begin = preg_replace_callback('~\s++(?!\z)|(\s++\z)~u', function ($m) { return !empty($m[1]) ? '' : ' '; }, $begin);
$end = preg_replace_callback('~\s++(?!\z)|(\s++\z)~u', function ($m) { return !empty($m[1]) ? '' : ' '; }, $end);
$begin_arr = preg_split('~(?=\X)~u', $begin, -1, PREG_SPLIT_NO_EMPTY);
$end_arr = preg_split('~(?=\X)~u', $end, -1, PREG_SPLIT_NO_EMPTY);
$reg = "(?s)(?:<[^<>]+>)?(?:&#?\\w+;)*\\s*" .  implode("", array_map(function($x, $k) use ($begin_arr) { return ($k < count($begin_arr) - 1 ? preg_quote($x, "~") . "(?:\s*(?:<[^<>]+>|&#?\\w+;))*" : preg_quote($x, "~"));}, $begin_arr, array_keys($begin_arr)))
        . "(.*?)" . 
        implode("", array_map(function($x, $k) use ($end_arr) { return ($k < count($end_arr) - 1 ? preg_quote($x, "~") . "(?:\s*(?:<[^<>]+>|&#?\\w+;))*" : preg_quote($x, "~"));}, $end_arr, array_keys($end_arr))); 
echo $reg .PHP_EOL;
preg_match('~' . $reg . '~u', $html, $m);
print_r($m[0]);

See the IDEONE demo

Algorithm:

  • Create a dynamic regex pattern by splitting the delimiter strings into single graphemes (since these can be Unicode characters, I suggest using preg_split('~(?<!^)(?=\X)~u', $end)) and imploding back by adding an optional tag matching pattern (?:<[^<>]+>)?.
  • Then, (?s) enables a DOTALL mode when . matches any character including a newline, and .*? will match 0+ characters from the leading to trailing delimiter.

Regex details:

  • '~(?<!^)(?=\X)~u matches every location other than at the start of the string before each grapheme
  • (sample final regex) (?s)(?:<[^<>]+>)?(?:&#?\w+;)*\s*H(?:\s*(?:<[^<>]+>|&#?\w+;))*e(?:\s*(?:<[^<>]+>|&#?\w+;))*l(?:\s*(?:<[^<>]+>|&#?\w+;))*l(?:\s*(?:<[^<>]+>|&#?\w+;))*o(?:\s*(?:<[^<>]+>|&#?\w+;))* (?:\s*(?:<[^<>]+>|&#?\w+;))*進(?:\s*(?:<[^<>]+>|&#?\w+;))*撃(?:\s*(?:<[^<>]+>|&#?\w+;))*の(?:\s*(?:<[^<>]+>|&#?\w+;))*巨(?:\s*(?:<[^<>]+>|&#?\w+;))*人(?:\s*(?:<[^<>]+>|&#?\w+;))*\!(?:\s*(?:<[^<>]+>|&#?\w+;))* + (.*?) + L(?:\s*(?:<[^<>]+>|&#?\w+;))*o(?:\s*(?:<[^<>]+>|&#?\w+;))*r(?:\s*(?:<[^<>]+>|&#?\w+;))*e(?:\s*(?:<[^<>]+>|&#?\w+;))*m(?:\s*(?:<[^<>]+>|&#?\w+;))* (?:\s*(?:<[^<>]+>|&#?\w+;))*i(?:\s*(?:<[^<>]+>|&#?\w+;))*p(?:\s*(?:<[^<>]+>|&#?\w+;))*s(?:\s*(?:<[^<>]+>|&#?\w+;))*u(?:\s*(?:<[^<>]+>|&#?\w+;))*m(?:\s*(?:<[^<>]+>|&#?\w+;))*\. - the leading and trailing delimiters with optional subpatterns for tag matching and a (.*?) (capturing might not be necessary) inside.
  • ~u modifier is necessary since Unicode strings are to be processed.
  • UPDATE: To account for 1+ spaces, any whitespace in the begin and end patterns can be replaced with \s+ subpattern to match any kind of 1+ whitespace characters in the input string.
  • UPDATE 2: The auxiliary $begin = preg_replace('~\s+~u', ' ', $begin); and $end = preg_replace('~\s+~u', ' ', $end); are necessary to account for 1+ whitespace in the input string.
  • To account for HTML entities, add another subpattern to the optional parts: &#?\\w+;, it will also match &nbsp; and &#123; like entities. It is also prepended with \s* to match optional whitespace, and quantified with * (can be zero or more).
like image 126
Wiktor Stribiżew Avatar answered Oct 19 '22 11:10

Wiktor Stribiżew


I really wanted to write a regex solution. But I am preceeded with some nice and complex solutions. So, here is a non-regex solution.

Short explanation: The major problem is keeping HTML tags. We could easily search text, if HTML tags were stripped. So: strip these! We can easily search in the stripped content, and produce a substring we want to cut. Then, try to cut this substring from the HTML while keeping the tags.

Advantages:

  • Searching is easy and independent from HTML, you can search with regex too if you need
  • Requirements are scalable: you can easily add full multibyte support, support for entities and white-space collapse, and so on
  • Relatively fast (it is possible, that a direct regex can be faster)
  • Does not touch original HTML, and adaptable to other markup languages

A static utility class for this scenario:

class HtmlExtractUtil
{

    const FAKE_MARKUP = '<>';
    const MARKUP_PATTERN = '#<[^>]+>#u';

    static public function extractBetween($html, $startTextToFind, $endTextToFind)
    {
        $strippedHtml = preg_replace(self::MARKUP_PATTERN, '', $html);
        $startPos = strpos($strippedHtml, $startTextToFind);
        $lastPos = strrpos($strippedHtml, $endTextToFind);

        if ($startPos === false || $lastPos === false) {
            return "";
        }

        $endPos = $lastPos + strlen($endTextToFind);
        if ($endPos <= $startPos) {
            return "";
        }

        return self::extractSubstring($html, $startPos, $endPos);
    }

    static public function extractSubstring($html, $startPos, $endPos)
    {
        preg_match_all(self::MARKUP_PATTERN, $html, $matches, PREG_OFFSET_CAPTURE);
        $start = -1;
        $end = -1;
        $previousEnd = 0;
        $stripPos = 0;
        $matchArray = $matches[0];
        $matchArray[] = [self::FAKE_MARKUP, strlen($html)];
        foreach ($matchArray as $match) {
            $diff = $previousEnd - $stripPos;
            $textLength = $match[1] - $previousEnd;
            if ($start == (-1)) {
                if ($startPos >= $stripPos && $startPos < $stripPos + $textLength) {
                    $start = $startPos + $diff;
                }
            }
            if ($end == (-1)) {
                if ($endPos > $stripPos && $endPos <= $stripPos + $textLength) {
                    $end = $endPos + $diff;
                    break;
                }
            }
            $tagLength = strlen($match[0]);
            $previousEnd = $match[1] + $tagLength;
            $stripPos += $textLength;
        }

        if ($start == (-1)) {
            return "";
        } elseif ($end == (-1)) {
            return substr($html, $start);
        } else {
            return substr($html, $start, $end - $start);
        }
    }

}

Usage:

$html = '
<html>
<body>
<p>Any string before</p>
<p>Hello <em>進撃の巨人</em>!</p>
random code
random code
<p>Lorem <span>ipsum<span>.</p>
<p>Any string after</p>
</body>
</html>
';
$startTextToFind = 'Hello 進撃の巨人!';
$endTextToFind = 'Lorem ipsum.';

$extractedText = HtmlExtractUtil::extractBetween($html, $startTextToFind, $endTextToFind);

header("Content-type: text/plain; charset=utf-8");
echo $extractedText . "\n";
like image 8
Dávid Horváth Avatar answered Oct 19 '22 11:10

Dávid Horváth


Regular expressions have their limitations when it comes to parsing HTML. Like many have done before me, I will refer to this famous answer.

Potential Problems when relying on Regular Expressions

For instance, imagine this tag appears in the HTML before the part that must be extracted:

<p attr="Hello 進撃の巨人!">This comes before the match</p>

Many regexp solutions will stumble over this, and return a string that starts in the middle of this opening p tag.

Or consider a comment inside the HTML section that has to be matched:

<!-- Next paragraph will display "Lorem ipsum." -->

Or, some loose less-than and greater-than signs appear (let's say in a comment, or attribute value):

<!-- Next paragraph will display >-> << Lorem ipsum. >> -->
<p data-attr="->->->" class="myclass">

What will those regexes do with that?

These are just examples... there are countless other situations that pose problems to regular expression based solutions.

There are more reliable ways to parse HTML.

Load the HTML into a DOM

I will suggest here a solution based on the DOMDocument interface, using this algorithm:

  1. Get the text content of the HTML document and identify the two offsets where both sub strings (begin/end) are located.

  2. Then go through the DOM text nodes keeping track of the offsets where these nodes fit in. In the nodes where either of the two bounding offsets are crossed, a predefined delimiter (|) is inserted. That delimiter should not be present in the HTML string. Therefore it is doubled (||, ||||, ...) until that condition is met;

  3. Finally split the HTML representation by this delimiter and extract the middle part as the result.

Here is the code:

function extractBetween($html, $begin, $end) {
    $dom = new DOMDocument();
    // Load HTML in DOM, making sure it supports UTF-8; double HTML tags are no problem
    $dom->loadHTML('<html><head>
            <meta http-equiv="content-type" content="text/html; charset=utf-8">
        </head></html>' . $html);
    // Get complete text content
    $text = $dom->textContent;
    // Get positions of the beginning/ending text; exit if not found.
    if (($from = strpos($text, $begin)) === false) return false;
    if (($to = strpos($text, $end, $from + strlen($begin))) === false) return false;
    $to += strlen($end);
    // Define a non-occurring delimiter by repeating `|` enough times:
    for ($delim = '|'; strpos($html, $delim) !== false; $delim .= $delim);
    // Use XPath to traverse the DOM
    $xpath = new DOMXPath($dom);
    // Go through the text nodes keeping track of total text length.
    // When exceeding one of the two offsets, inject a delimiter at that position.
    $pos = 0;
    foreach($xpath->evaluate("//text()") as $node) {
        // Add length of node's text content to total length
        $newpos = $pos + strlen($node->nodeValue);
        while ($newpos > $from || ($from === $to && $newpos === $from)) {
            // The beginning/ending text starts/ends somewhere in this text node.
            // Inject the delimiter at that position:
            $node->nodeValue = substr_replace($node->nodeValue, $delim, $from - $pos, 0);
            // If a delimiter was inserted at both beginning and ending texts,
            // then get the HTML and return the part between the delimiters
            if ($from === $to) return explode($delim, $dom->saveHTML())[1];
            // Delimiter was inserted at beginning text. Now search for ending text
            $from = $to;
        }
        $pos = $newpos;
    }
}

You would call it like this:

// Sample input data
$html = '
        <html>
        <body>
        <p>This comes before the match</p>
        <p>Hey! Hello <em>進撃の巨人</em>!</p>
        random code
        random code
        <p>Lorem <span>ipsum<span>. la la la</p>
        <p>This comes after the match</p>
        </body>
        </html>
        ';

$begin = 'Hello 進撃の巨人!';
$end = 'Lorem ipsum.';

// Call
$html = extractBetween($html, $begin, $end);

// Output result
echo $html;

Output:

Hello <em>進撃の巨人</em>!</p>
        random code
        random code
        <p>Lorem <span>ipsum<span>.

You'll find this code is also easier to maintain than regex alternatives.

See it run on eval.in.

like image 7
trincot Avatar answered Oct 19 '22 12:10

trincot