Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Puzzle: Splitting An HTML String Correctly

I'm trying to split an HTML string by a token in order to create a blog preview without displaying the full post. It's a little harder than I first thought. Here are the problems:

  • A user will be creating the HTML through a WYSIWYG editor (CKEditor). The markup isn't guaranteed to be pretty or consistent.
  • The token, read_more(), can be placed anywhere in the string, including being nested within a paragraph tag.
  • The resulting first split string needs to be valid HTML for all reasonable uses of the token.

Examples of possible uses:

<p>Some text here. read_more()</p>

<p>Some text read more() here.</p>

<p>read_more()</p>

<p>  read_more()</p>

read_more()

So far, I've tried just splitting the string on the token, but it leaves invalid HTML. Regex is perhaps another option. What strategy would you use to solve this and make it as bulletproof as possible? Any code snippets or hints would also be appreciated (I'm using PHP).

like image 241
VirtuosiMedia Avatar asked Aug 01 '10 01:08

VirtuosiMedia


1 Answers

function stripmore($in)
{
    list($p1,$p2) = explode("read_more()",$in,2);

    $pass1 = preg_replace("~>[^<>]+<~","><",$p2);
    $pass2 = preg_replace("~^[^<>]+~","",$pass1);

    $pass3 = null;
    while ( $pass3 != $pass2 )
    {
        if ( $pass3 !== null ) $pass2 = $pass3;
        $pass3 = preg_replace("~<([^<>]+)></\\1>~","",$pass2);
    }

    return $p1."read_more()".$pass3;
}

this strips any non-html after the read_more() mark, and reduces it to the minimum by stripping corresponding tags, while keeping any tag starting before and ending after the mark:

<p>Some text here. read_more()</p>
      ==> <p>Some text here. read_more()</p>

<p>Some <b>text</b> read_more() <b>here</b>.</p>
      ==> <p>Some <b>text</b> read_more()</p>

<p>Some <b>text read_more() here</b>.</p>
      ==> <p>Some <b>text read_more()</b></p>
like image 104
mvds Avatar answered Oct 24 '22 02:10

mvds