I would like to find all root-level #text nodes (or those with div parents) which should be wrapped inside a <p>
tag. In the following text there should be three (or even just two) final root <p>
tags.
<div>
This text should be wrapped in a p tag.
</div>
This also should be wrapped.
<b>And</b> this.
The idea is to format the text nicer so that text blocks are grouped into paragraphs for HTML display. However, the following xpath I have been working out seems to fail to select the text nodes.
<?php
$html = '<div>
This text should be wrapped in a p tag.
</div>
This also should be wrapped.
<b>And</b> this.';
libxml_use_internal_errors(TRUE);
$dom = DOMDocument::loadHTML($html);
$xp = new DOMXPath($dom);
$xpath = '//text()[not(parent::p) and normalize-space()]';
foreach($xp->query($xpath) as $node) {
$element = $dom->createElement('p');
$node->parentNode->replaceChild($element, $node);
$element->appendChild($node);
}
print $dom->saveHTML();
The DOM model uses Element nodes to represent Element Information Items. These nodes of a document are directly used to represent the elements of an XPath result.
The intent is to locate the fields using XPath. Go to the First name tab and right click >> Inspect. On inspecting the web element, it will show an input tag and attributes like class and id. Use the id and these attributes to construct XPath which, in turn, will locate the first name field.
The XML Document Object Model (DOM) contains methods that allow you to use XML Path Language (XPath) navigation to query information in the DOM. You can use XPath to find a single, specific node or to find all nodes that match some criteria.
OK, so let me rephrase my comment as an answer. If you want to match all text nodes, you should simply remove the //div
part from your XPath expression. So it becomes:
//text()[not(parent::p) and normalize-space()]
Your scenario has many edge-cases and the word should is adding on top. I assume you want to do the classic a double break starts a new paragraph thingy, however this time within parent <div>
(or certainly other block elements) as well.
I would let do the HTML parser most of the work but I still would work with text search and replace (next to xpath). So what you will see coming is a bit hackish but I think pretty stable:
First of all I would select all text-nodes that are of top-level or child of the said div.
(.|./div)/text()
This xpath is relative to an anchor element which is the <body>
tag as it represents the root-tag of your HTML fragment when loaded into DOMDocument
.
If child of a div then I would insert the starting paragraph at the very beginning.
Then in any case I would insert a break-mark (here in form of a comment) at each occurrence of the sequence that starts a new paragraph (that should be "\n\n"
because of whitespace normalization, I might be wrong and if it doesn't apply, you would need to do the whitespace-normalization upfront to have this working transparently).
/* @var $result DOMText[] */
$result = $xp->query('(.|./div)/text()', $anchor);
foreach ($result as $i => $node)
{
if ($node->parentNode->tagName == 'div')
{
$insertBreakMarkBefore($node, true);
}
while (FALSE !== $pos = strpos($node->data, $paragraphSequence))
{
$node = $node->splitText($pos + $paragraphSequenceLength);
$insertBreakMarkBefore($node);
}
}
These inserted break-marks are just there to be replaced with a HTML <p>
tag. A HTML parser will turn those into adequate <p>...</p>
pairs so I can spare myself writing that algorithm (even though, this might be interesting). This basically work like I once outlined in some other answer but I just don't find the link any longer:
<body>
again."<p>"
(here I mark the class as well to make this visible)<p>...</p>
pairs.DOMDocument
parser, which now is finally.These outlined steps in code (skipping some of the function definitions for a moment):
$needle = sprintf('%1$s<!--%2$s-->%1$s', $paragraphSequence, $paragraphComment);
$replace = sprintf("\n<p class=\"%s\">\n", $paragraphComment);
$html = strtr($innerHTML($anchor), array($needle . $needle => $replace, $needle => $replace));
echo "HTML afterwards:\n", $innerHTML($loadHTMLFragment($html));
As this shows, double sequences are replaced with a single one. Probably one at the end need to be deleted as well (if applicale, you could also trim whitespace here).
The final HTML output:
<div>
<p class="break">
This text should be wrapped in a p tag.
</p>
</div>
<p class="break">
This also should be wrapped.
</p>
<p class="break">
<b>And</b> this.</p>
Some more post-production for nice output formatting can be useful, too. Actually I think it's worth to do as it will help you tweak the algorithm (Full Demo - just seeing, whitespace normalization probably does not apply there. so use with care).
you can do it with pure JavaScript if you wish:
var content = document.evaluate(
'//text()',
document,
null,
XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,
null );
for ( var i=0 ; i < content .snapshotLength; i++ ){
console.log( content .snapshotItem(i).textContent );
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With