Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

DOM xpath to find #text nodes and wrap in paragraph tag

Tags:

html

dom

php

xpath

I would like to find all root-level #text nodes (or those with div parents) which should be wrapped inside a <p> tag. In the following text there should be three (or even just two) final root <p> tags.

<div>
    This text should be wrapped in a p tag.
</div>

This also should be wrapped.

<b>And</b> this.

The idea is to format the text nicer so that text blocks are grouped into paragraphs for HTML display. However, the following xpath I have been working out seems to fail to select the text nodes.

    <?php

$html = '<div>
    This text should be wrapped in a p tag.
</div>

This also should be wrapped.

<b>And</b> this.';

libxml_use_internal_errors(TRUE);

$dom = DOMDocument::loadHTML($html);

$xp = new DOMXPath($dom);

$xpath = '//text()[not(parent::p) and normalize-space()]';

foreach($xp->query($xpath) as $node) {
    $element = $dom->createElement('p');
    $node->parentNode->replaceChild($element, $node);
    $element->appendChild($node);
}

print $dom->saveHTML();
like image 849
Xeoncross Avatar asked Mar 21 '13 16:03

Xeoncross


People also ask

What is DOM in XPath?

The DOM model uses Element nodes to represent Element Information Items. These nodes of a document are directly used to represent the elements of an XPath result.

How do I search in XPath?

The intent is to locate the fields using XPath. Go to the First name tab and right click >> Inspect. On inspecting the web element, it will show an input tag and attributes like class and id. Use the id and these attributes to construct XPath which, in turn, will locate the first name field.

Does XPath use DOM?

The XML Document Object Model (DOM) contains methods that allow you to use XML Path Language (XPath) navigation to query information in the DOM. You can use XPath to find a single, specific node or to find all nodes that match some criteria.


3 Answers

OK, so let me rephrase my comment as an answer. If you want to match all text nodes, you should simply remove the //div part from your XPath expression. So it becomes:

//text()[not(parent::p) and normalize-space()]
like image 147
nwellnhof Avatar answered Oct 06 '22 11:10

nwellnhof


Your scenario has many edge-cases and the word should is adding on top. I assume you want to do the classic a double break starts a new paragraph thingy, however this time within parent <div> (or certainly other block elements) as well.

I would let do the HTML parser most of the work but I still would work with text search and replace (next to xpath). So what you will see coming is a bit hackish but I think pretty stable:

First of all I would select all text-nodes that are of top-level or child of the said div.

(.|./div)/text()

This xpath is relative to an anchor element which is the <body> tag as it represents the root-tag of your HTML fragment when loaded into DOMDocument.

If child of a div then I would insert the starting paragraph at the very beginning.

Then in any case I would insert a break-mark (here in form of a comment) at each occurrence of the sequence that starts a new paragraph (that should be "\n\n" because of whitespace normalization, I might be wrong and if it doesn't apply, you would need to do the whitespace-normalization upfront to have this working transparently).

/* @var $result DOMText[] */
$result = $xp->query('(.|./div)/text()', $anchor);

foreach ($result as $i => $node)
{
    if ($node->parentNode->tagName == 'div')
    {
        $insertBreakMarkBefore($node, true);
    }

    while (FALSE !== $pos = strpos($node->data, $paragraphSequence))
    {
        $node = $node->splitText($pos + $paragraphSequenceLength);
        $insertBreakMarkBefore($node);
    }
}

These inserted break-marks are just there to be replaced with a HTML <p> tag. A HTML parser will turn those into adequate <p>...</p> pairs so I can spare myself writing that algorithm (even though, this might be interesting). This basically work like I once outlined in some other answer but I just don't find the link any longer:

  1. After the modification of the DOM tree, get the innter HTML of the <body> again.
  2. Replace the set marks with "<p>" (here I mark the class as well to make this visible)
  3. Load the HTML fragment into the parser again to re-create the DOM with the proper <p>...</p> pairs.
  4. Obtain the HTML again from the DOMDocument parser, which now is finally.

These outlined steps in code (skipping some of the function definitions for a moment):

$needle  = sprintf('%1$s<!--%2$s-->%1$s', $paragraphSequence, $paragraphComment);
$replace = sprintf("\n<p class=\"%s\">\n", $paragraphComment);
$html    = strtr($innerHTML($anchor), array($needle . $needle => $replace, $needle => $replace));

echo "HTML afterwards:\n", $innerHTML($loadHTMLFragment($html));

As this shows, double sequences are replaced with a single one. Probably one at the end need to be deleted as well (if applicale, you could also trim whitespace here).

The final HTML output:

<div>
<p class="break">

    This text should be wrapped in a p tag.
</p>
</div>
<p class="break">
This also should be wrapped.
</p>
<p class="break">
<b>And</b> this.</p>

Some more post-production for nice output formatting can be useful, too. Actually I think it's worth to do as it will help you tweak the algorithm (Full Demo - just seeing, whitespace normalization probably does not apply there. so use with care).

like image 31
hakre Avatar answered Oct 06 '22 11:10

hakre


you can do it with pure JavaScript if you wish:

var content = document.evaluate(
                                      '//text()', 
                                      document, 
                                      null, 
                                      XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, 
                                      null );

for ( var i=0 ; i < content .snapshotLength; i++ ){
  console.log( content .snapshotItem(i).textContent );
}
like image 24
CodeWizard Avatar answered Oct 06 '22 10:10

CodeWizard