I have html code stored in a string, example:
$html = '
<html>
<body>
<p>Hello <em>進撃の巨人</em>!</p>
random code
random code
<p>Lorem <span>ipsum<span>.</p>
</body>
</html>
';
Then I have two sentences stored in variables:
$begin = 'Hello 進撃の巨人!';
$end = 'Lorem ipsum.';
I want to search $html
for these two sentences, and strip everything before and after them. So $html
will become:
$html = 'Hello <em>進撃の巨人</em>!</p>
random code
random code
<p>Lorem <span>ipsum<span>.';
How can I achieve this? Note that the $begin
and $end
variables do not have html tags but the sentences in $html
very likely do have tags as shown above.
Maybe a regex approach?
A strpos()
approach. The problem is that $html
contains tags in the sentences, making the $begin
and $end
sentences not match. I can strip_tags($html)
before running strpos()
, but then I will obviously end up with $html
without the tags.
Search part of variable, like Hello
, but that's never safe and will give many matches.
Here is a short, yet - I believe - working solution based on a lazy dot matching regex (that can be improved by creating a longer, unrolled regex, but should be enough unless you have really large chunks of text).
$html = "<html>\n<body>\n<p><p>H<div>ello</div><script></script> <em>進 撃の巨人</em>!</p>\nrandom code\nrandom code\n<p>Lorem <span>ipsum<span>.</p>\n</body>\n </html>";
$begin = 'Hello 進撃の巨人!';
$end = 'Lorem ipsum.';
$begin = preg_replace_callback('~\s++(?!\z)|(\s++\z)~u', function ($m) { return !empty($m[1]) ? '' : ' '; }, $begin);
$end = preg_replace_callback('~\s++(?!\z)|(\s++\z)~u', function ($m) { return !empty($m[1]) ? '' : ' '; }, $end);
$begin_arr = preg_split('~(?=\X)~u', $begin, -1, PREG_SPLIT_NO_EMPTY);
$end_arr = preg_split('~(?=\X)~u', $end, -1, PREG_SPLIT_NO_EMPTY);
$reg = "(?s)(?:<[^<>]+>)?(?:&#?\\w+;)*\\s*" . implode("", array_map(function($x, $k) use ($begin_arr) { return ($k < count($begin_arr) - 1 ? preg_quote($x, "~") . "(?:\s*(?:<[^<>]+>|&#?\\w+;))*" : preg_quote($x, "~"));}, $begin_arr, array_keys($begin_arr)))
. "(.*?)" .
implode("", array_map(function($x, $k) use ($end_arr) { return ($k < count($end_arr) - 1 ? preg_quote($x, "~") . "(?:\s*(?:<[^<>]+>|&#?\\w+;))*" : preg_quote($x, "~"));}, $end_arr, array_keys($end_arr)));
echo $reg .PHP_EOL;
preg_match('~' . $reg . '~u', $html, $m);
print_r($m[0]);
See the IDEONE demo
Algorithm:
preg_split('~(?<!^)(?=\X)~u', $end)
) and imploding back by adding an optional tag matching pattern (?:<[^<>]+>)?
.(?s)
enables a DOTALL mode when .
matches any character including a newline, and .*?
will match 0+ characters from the leading to trailing delimiter.Regex details:
'~(?<!^)(?=\X)~u
matches every location other than at the start of the string before each grapheme(?s)(?:<[^<>]+>)?(?:&#?\w+;)*\s*H(?:\s*(?:<[^<>]+>|&#?\w+;))*e(?:\s*(?:<[^<>]+>|&#?\w+;))*l(?:\s*(?:<[^<>]+>|&#?\w+;))*l(?:\s*(?:<[^<>]+>|&#?\w+;))*o(?:\s*(?:<[^<>]+>|&#?\w+;))* (?:\s*(?:<[^<>]+>|&#?\w+;))*進(?:\s*(?:<[^<>]+>|&#?\w+;))*撃(?:\s*(?:<[^<>]+>|&#?\w+;))*の(?:\s*(?:<[^<>]+>|&#?\w+;))*巨(?:\s*(?:<[^<>]+>|&#?\w+;))*人(?:\s*(?:<[^<>]+>|&#?\w+;))*\!(?:\s*(?:<[^<>]+>|&#?\w+;))*
+ (.*?)
+ L(?:\s*(?:<[^<>]+>|&#?\w+;))*o(?:\s*(?:<[^<>]+>|&#?\w+;))*r(?:\s*(?:<[^<>]+>|&#?\w+;))*e(?:\s*(?:<[^<>]+>|&#?\w+;))*m(?:\s*(?:<[^<>]+>|&#?\w+;))* (?:\s*(?:<[^<>]+>|&#?\w+;))*i(?:\s*(?:<[^<>]+>|&#?\w+;))*p(?:\s*(?:<[^<>]+>|&#?\w+;))*s(?:\s*(?:<[^<>]+>|&#?\w+;))*u(?:\s*(?:<[^<>]+>|&#?\w+;))*m(?:\s*(?:<[^<>]+>|&#?\w+;))*\.
- the leading and trailing delimiters with optional subpatterns for tag matching and a (.*?)
(capturing might not be necessary) inside.~u
modifier is necessary since Unicode strings are to be processed.begin
and end
patterns can be replaced with \s+
subpattern to match any kind of 1+ whitespace characters in the input string.$begin = preg_replace('~\s+~u', ' ', $begin);
and $end = preg_replace('~\s+~u', ' ', $end);
are necessary to account for 1+ whitespace in the input string.&#?\\w+;
, it will also match
and {
like entities. It is also prepended with \s*
to match optional whitespace, and quantified with *
(can be zero or more).I really wanted to write a regex solution. But I am preceeded with some nice and complex solutions. So, here is a non-regex solution.
Short explanation: The major problem is keeping HTML tags. We could easily search text, if HTML tags were stripped. So: strip these! We can easily search in the stripped content, and produce a substring we want to cut. Then, try to cut this substring from the HTML while keeping the tags.
Advantages:
A static utility class for this scenario:
class HtmlExtractUtil
{
const FAKE_MARKUP = '<>';
const MARKUP_PATTERN = '#<[^>]+>#u';
static public function extractBetween($html, $startTextToFind, $endTextToFind)
{
$strippedHtml = preg_replace(self::MARKUP_PATTERN, '', $html);
$startPos = strpos($strippedHtml, $startTextToFind);
$lastPos = strrpos($strippedHtml, $endTextToFind);
if ($startPos === false || $lastPos === false) {
return "";
}
$endPos = $lastPos + strlen($endTextToFind);
if ($endPos <= $startPos) {
return "";
}
return self::extractSubstring($html, $startPos, $endPos);
}
static public function extractSubstring($html, $startPos, $endPos)
{
preg_match_all(self::MARKUP_PATTERN, $html, $matches, PREG_OFFSET_CAPTURE);
$start = -1;
$end = -1;
$previousEnd = 0;
$stripPos = 0;
$matchArray = $matches[0];
$matchArray[] = [self::FAKE_MARKUP, strlen($html)];
foreach ($matchArray as $match) {
$diff = $previousEnd - $stripPos;
$textLength = $match[1] - $previousEnd;
if ($start == (-1)) {
if ($startPos >= $stripPos && $startPos < $stripPos + $textLength) {
$start = $startPos + $diff;
}
}
if ($end == (-1)) {
if ($endPos > $stripPos && $endPos <= $stripPos + $textLength) {
$end = $endPos + $diff;
break;
}
}
$tagLength = strlen($match[0]);
$previousEnd = $match[1] + $tagLength;
$stripPos += $textLength;
}
if ($start == (-1)) {
return "";
} elseif ($end == (-1)) {
return substr($html, $start);
} else {
return substr($html, $start, $end - $start);
}
}
}
Usage:
$html = '
<html>
<body>
<p>Any string before</p>
<p>Hello <em>進撃の巨人</em>!</p>
random code
random code
<p>Lorem <span>ipsum<span>.</p>
<p>Any string after</p>
</body>
</html>
';
$startTextToFind = 'Hello 進撃の巨人!';
$endTextToFind = 'Lorem ipsum.';
$extractedText = HtmlExtractUtil::extractBetween($html, $startTextToFind, $endTextToFind);
header("Content-type: text/plain; charset=utf-8");
echo $extractedText . "\n";
Regular expressions have their limitations when it comes to parsing HTML. Like many have done before me, I will refer to this famous answer.
For instance, imagine this tag appears in the HTML before the part that must be extracted:
<p attr="Hello 進撃の巨人!">This comes before the match</p>
Many regexp solutions will stumble over this, and return a string that starts in the middle of this opening p
tag.
Or consider a comment inside the HTML section that has to be matched:
<!-- Next paragraph will display "Lorem ipsum." -->
Or, some loose less-than and greater-than signs appear (let's say in a comment, or attribute value):
<!-- Next paragraph will display >-> << Lorem ipsum. >> -->
<p data-attr="->->->" class="myclass">
What will those regexes do with that?
These are just examples... there are countless other situations that pose problems to regular expression based solutions.
There are more reliable ways to parse HTML.
I will suggest here a solution based on the DOMDocument interface, using this algorithm:
Get the text content of the HTML document and identify the two offsets where both sub strings (begin/end) are located.
Then go through the DOM text nodes keeping track of the offsets where these nodes fit in. In the nodes where either of the two bounding offsets are crossed, a predefined delimiter (|
) is inserted. That delimiter should not be present in the HTML string. Therefore it is doubled (||
, ||||
, ...) until that condition is met;
Finally split the HTML representation by this delimiter and extract the middle part as the result.
Here is the code:
function extractBetween($html, $begin, $end) {
$dom = new DOMDocument();
// Load HTML in DOM, making sure it supports UTF-8; double HTML tags are no problem
$dom->loadHTML('<html><head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
</head></html>' . $html);
// Get complete text content
$text = $dom->textContent;
// Get positions of the beginning/ending text; exit if not found.
if (($from = strpos($text, $begin)) === false) return false;
if (($to = strpos($text, $end, $from + strlen($begin))) === false) return false;
$to += strlen($end);
// Define a non-occurring delimiter by repeating `|` enough times:
for ($delim = '|'; strpos($html, $delim) !== false; $delim .= $delim);
// Use XPath to traverse the DOM
$xpath = new DOMXPath($dom);
// Go through the text nodes keeping track of total text length.
// When exceeding one of the two offsets, inject a delimiter at that position.
$pos = 0;
foreach($xpath->evaluate("//text()") as $node) {
// Add length of node's text content to total length
$newpos = $pos + strlen($node->nodeValue);
while ($newpos > $from || ($from === $to && $newpos === $from)) {
// The beginning/ending text starts/ends somewhere in this text node.
// Inject the delimiter at that position:
$node->nodeValue = substr_replace($node->nodeValue, $delim, $from - $pos, 0);
// If a delimiter was inserted at both beginning and ending texts,
// then get the HTML and return the part between the delimiters
if ($from === $to) return explode($delim, $dom->saveHTML())[1];
// Delimiter was inserted at beginning text. Now search for ending text
$from = $to;
}
$pos = $newpos;
}
}
You would call it like this:
// Sample input data
$html = '
<html>
<body>
<p>This comes before the match</p>
<p>Hey! Hello <em>進撃の巨人</em>!</p>
random code
random code
<p>Lorem <span>ipsum<span>. la la la</p>
<p>This comes after the match</p>
</body>
</html>
';
$begin = 'Hello 進撃の巨人!';
$end = 'Lorem ipsum.';
// Call
$html = extractBetween($html, $begin, $end);
// Output result
echo $html;
Output:
Hello <em>進撃の巨人</em>!</p>
random code
random code
<p>Lorem <span>ipsum<span>.
You'll find this code is also easier to maintain than regex alternatives.
See it run on eval.in.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With