I'm using domDocument to parse some HTML, and want to replace breaks with \n. However, I'm having problems identifying where a break actually occurs within the document.
Given the following snippet of HTML - from a much larger file that I'm reading using $dom->loadHTMLFile($pFilename):
<p>Multiple-line paragraph<br />that has a close tag</p>
and my code:
foreach ($dom->getElementsByTagName('*') as $domElement) {
switch (strtolower($domElement->nodeName)) {
case 'p' :
$str = (string) $domElement->nodeValue;
echo 'PARAGRAPH: ',$str,PHP_EOL;
break;
case 'br' :
echo 'BREAK: ',PHP_EOL;
break;
}
}
I get:
PARAGRAPH: Multiple-line paragraphthat has a close tag
BREAK:
How can I identify the position of that break within the paragraph, and replace it with a \n ?
Or is there a better alternative than using domDocument for parsing HTML that may or may not be well-formed?
You can't get the position of an element using getElementsByTagName
. You should go through childNodes
of each element and process text nodes and elements separately.
In the general case you'll need recursion, like this:
function processElement(DOMNode $element){
foreach($element->childNodes as $child){
if($child instanceOf DOMText){
echo $child->nodeValue,PHP_EOL;
}elseif($child instanceOf DOMElement){
switch($child->nodeName){
case 'br':
echo 'BREAK: ',PHP_EOL;
break;
case 'p':
echo 'PARAGRAPH: ',PHP_EOL;
processElement($child);
echo 'END OF PARAGRAPH;',PHP_EOL;
break;
// etc.
// other cases:
default:
processElement($child);
}
}
}
}
$D = new DOMDocument;
$D->loadHTML('<p>Multiple-line paragraph<br />that has a close tag</p>');
processElement($D);
This will output:
PARAGRAPH:
Multiple-line paragraph
BREAK:
that has a close tag
END OF PARAGRAPH;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With