domDocument

Question

I'm using domDocument to parse some HTML, and want to replace breaks with . However, I'm having problems identifying where a break actually occurs within the document.

Given the following snippet of HTML - from a much larger file that I'm reading using $dom->loadHTMLFile($pFilename):

<p>Multiple-line paragraph<br />that has a close tag</p>

and my code:

foreach ($dom->getElementsByTagName('*') as $domElement) {
    switch (strtolower($domElement->nodeName)) {
        case 'p' :
            $str = (string) $domElement->nodeValue;
            echo 'PARAGRAPH: ',$str,PHP_EOL;
            break;
        case 'br' :
            echo 'BREAK: ',PHP_EOL;
            break;
    }
}

I get:

PARAGRAPH: Multiple-line paragraphthat has a close tag
BREAK:

How can I identify the position of that break within the paragraph, and replace it with a ?

Or is there a better alternative than using domDocument for parsing HTML that may or may not be well-formed?

Hrant Khachatrian · Accepted Answer

You can't get the position of an element using getElementsByTagName. You should go through childNodes of each element and process text nodes and elements separately.

In the general case you'll need recursion, like this:

function processElement(DOMNode $element){
    foreach($element->childNodes as $child){
        if($child instanceOf DOMText){
            echo $child->nodeValue,PHP_EOL;
        }elseif($child instanceOf DOMElement){
            switch($child->nodeName){
            case 'br':
                echo 'BREAK: ',PHP_EOL;
                break;
            case 'p':
                echo 'PARAGRAPH: ',PHP_EOL;
                processElement($child);
                echo 'END OF PARAGRAPH;',PHP_EOL;
                break;
            // etc.
            // other cases:
            default:
                processElement($child);
            }
        }
    }
}

$D = new DOMDocument;
$D->loadHTML('<p>Multiple-line paragraph<br />that has a close tag</p>');
processElement($D);

This will output:

PARAGRAPH: 
Multiple-line paragraph
BREAK:
that has a close tag
END OF PARAGRAPH;

domDocument - Identifying position of a <br />

Tags:

php

Mark Baker

1 Answers

Hrant Khachatrian

Recent Activity

Donate For Us