I'm trying to split an HTML string by a token in order to create a blog preview without displaying the full post. It's a little harder than I first thought. Here are the problems:
read_more()
, can
be placed anywhere in the string,
including being nested within a
paragraph tag.Examples of possible uses:
<p>Some text here. read_more()</p>
<p>Some text read more() here.</p>
<p>read_more()</p>
<p> read_more()</p>
read_more()
So far, I've tried just splitting the string on the token, but it leaves invalid HTML. Regex is perhaps another option. What strategy would you use to solve this and make it as bulletproof as possible? Any code snippets or hints would also be appreciated (I'm using PHP).
function stripmore($in)
{
list($p1,$p2) = explode("read_more()",$in,2);
$pass1 = preg_replace("~>[^<>]+<~","><",$p2);
$pass2 = preg_replace("~^[^<>]+~","",$pass1);
$pass3 = null;
while ( $pass3 != $pass2 )
{
if ( $pass3 !== null ) $pass2 = $pass3;
$pass3 = preg_replace("~<([^<>]+)></\\1>~","",$pass2);
}
return $p1."read_more()".$pass3;
}
this strips any non-html after the read_more() mark, and reduces it to the minimum by stripping corresponding tags, while keeping any tag starting before and ending after the mark:
<p>Some text here. read_more()</p>
==> <p>Some text here. read_more()</p>
<p>Some <b>text</b> read_more() <b>here</b>.</p>
==> <p>Some <b>text</b> read_more()</p>
<p>Some <b>text read_more() here</b>.</p>
==> <p>Some <b>text read_more()</b></p>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With