DOM xpath to find #text nodes and wrap in paragraph tag

Tags:

I would like to find all root-level #text nodes (or those with div parents) which should be wrapped inside a  tag. In the following text there should be three (or even just two) final root  tags.

<div>
    This text should be wrapped in a p tag.
</div>

This also should be wrapped.

<b>And</b> this.

The idea is to format the text nicer so that text blocks are grouped into paragraphs for HTML display. However, the following xpath I have been working out seems to fail to select the text nodes.

    <?php

$html = '<div>
    This text should be wrapped in a p tag.
</div>

This also should be wrapped.

<b>And</b> this.';

libxml_use_internal_errors(TRUE);

$dom = DOMDocument::loadHTML($html);

$xp = new DOMXPath($dom);

$xpath = '//text()[not(parent::p) and normalize-space()]';

foreach($xp->query($xpath) as $node) {
    $element = $dom->createElement('p');
    $node->parentNode->replaceChild($element, $node);
    $element->appendChild($node);
}

print $dom->saveHTML();

849

asked Mar 21 '13 16:03

Xeoncross

3 Answers

OK, so let me rephrase my comment as an answer. If you want to match all text nodes, you should simply remove the //div part from your XPath expression. So it becomes:

//text()[not(parent::p) and normalize-space()]

147

answered Oct 06 '22 11:10

nwellnhof

Your scenario has many edge-cases and the word should is adding on top. I assume you want to do the classic a double break starts a new paragraph thingy, however this time within parent <div> (or certainly other block elements) as well.

I would let do the HTML parser most of the work but I still would work with text search and replace (next to xpath). So what you will see coming is a bit hackish but I think pretty stable:

First of all I would select all text-nodes that are of top-level or child of the said div.

(.|./div)/text()

This xpath is relative to an anchor element which is the <body> tag as it represents the root-tag of your HTML fragment when loaded into DOMDocument.

If child of a div then I would insert the starting paragraph at the very beginning.

Then in any case I would insert a break-mark (here in form of a comment) at each occurrence of the sequence that starts a new paragraph (that should be "\n\n" because of whitespace normalization, I might be wrong and if it doesn't apply, you would need to do the whitespace-normalization upfront to have this working transparently).

/* @var $result DOMText[] */
$result = $xp->query('(.|./div)/text()', $anchor);

foreach ($result as $i => $node)
{
    if ($node->parentNode->tagName == 'div')
    {
        $insertBreakMarkBefore($node, true);
    }

    while (FALSE !== $pos = strpos($node->data, $paragraphSequence))
    {
        $node = $node->splitText($pos + $paragraphSequenceLength);
        $insertBreakMarkBefore($node);
    }
}

These inserted break-marks are just there to be replaced with a HTML  tag. A HTML parser will turn those into adequate ... pairs so I can spare myself writing that algorithm (even though, this might be interesting). This basically work like I once outlined in some other answer but I just don't find the link any longer:

After the modification of the DOM tree, get the innter HTML of the <body> again.
Replace the set marks with "" (here I mark the class as well to make this visible)
Load the HTML fragment into the parser again to re-create the DOM with the proper ... pairs.
Obtain the HTML again from the DOMDocument parser, which now is finally.

These outlined steps in code (skipping some of the function definitions for a moment):

$needle  = sprintf('%1$s<!--%2$s-->%1$s', $paragraphSequence, $paragraphComment);
$replace = sprintf("\n<p class=\"%s\">\n", $paragraphComment);
$html    = strtr($innerHTML($anchor), array($needle . $needle => $replace, $needle => $replace));

echo "HTML afterwards:\n", $innerHTML($loadHTMLFragment($html));

As this shows, double sequences are replaced with a single one. Probably one at the end need to be deleted as well (if applicale, you could also trim whitespace here).

The final HTML output:

<div>
<p class="break">

    This text should be wrapped in a p tag.
</p>
</div>
<p class="break">
This also should be wrapped.
</p>
<p class="break">
<b>And</b> this.</p>

Some more post-production for nice output formatting can be useful, too. Actually I think it's worth to do as it will help you tweak the algorithm (Full Demo - just seeing, whitespace normalization probably does not apply there. so use with care).

answered Oct 06 '22 11:10

hakre

you can do it with pure JavaScript if you wish:

var content = document.evaluate(
                                      '//text()', 
                                      document, 
                                      null, 
                                      XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, 
                                      null );

for ( var i=0 ; i < content .snapshotLength; i++ ){
  console.log( content .snapshotItem(i).textContent );
}

answered Oct 06 '22 10:10

CodeWizard

Related questions
                            
                                Building custom PHP extension (.so)
                            
                                JQuery - Drag n Drop Files - How to get file info?
                            
                                Is there any sort of "pre login" event or similar?
                            
                                What happens to canceled requests to a PHP page?
                            
                                How to load content without refresh with multiple querystring
                            
                                Is it possible to use Apache Thrift on a regular web server?
                            
                                Add commas to items and with "and" near the end in PHP
                            
                                php/iis: failed to open stream: Permission denied
                            
                                Setting up a repository pattern in MVC
                            
                                Is there any native PHP function which throws an built-in Exception?
                            
                                PHP regex crashing apache
                            
                                SQL injections in ADOdb and general website security
                            
                                Convert HTML form data into a PDF file using PHP
                            
                                Difference between escape('html') and escape('html_attr') in Twig
                            
                                Fixing requirements in Symfony2
                            
                                Google Drive API - PHP Client Library - setting uploadType to resumable upload
                            
                                Array combinatorics in PHP
                            
                                Generate PDF using TCPDF on ajax call
                            
                                PHP LDAP Get User Attributes, Including Associated Groups
                            
                                Enabling CORS in CakePHP app

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

DOM xpath to find #text nodes and wrap in paragraph tag

Tags:

html

dom

php

xpath