Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

domDocument - Identifying position of a <br />

I'm using domDocument to parse some HTML, and want to replace breaks with \n. However, I'm having problems identifying where a break actually occurs within the document.

Given the following snippet of HTML - from a much larger file that I'm reading using $dom->loadHTMLFile($pFilename):

<p>Multiple-line paragraph<br />that has a close tag</p>

and my code:

foreach ($dom->getElementsByTagName('*') as $domElement) {
    switch (strtolower($domElement->nodeName)) {
        case 'p' :
            $str = (string) $domElement->nodeValue;
            echo 'PARAGRAPH: ',$str,PHP_EOL;
            break;
        case 'br' :
            echo 'BREAK: ',PHP_EOL;
            break;
    }
}

I get:

PARAGRAPH: Multiple-line paragraphthat has a close tag
BREAK:

How can I identify the position of that break within the paragraph, and replace it with a \n ?

Or is there a better alternative than using domDocument for parsing HTML that may or may not be well-formed?

like image 844
Mark Baker Avatar asked Dec 22 '22 04:12

Mark Baker


1 Answers

You can't get the position of an element using getElementsByTagName. You should go through childNodes of each element and process text nodes and elements separately.

In the general case you'll need recursion, like this:

function processElement(DOMNode $element){
    foreach($element->childNodes as $child){
        if($child instanceOf DOMText){
            echo $child->nodeValue,PHP_EOL;
        }elseif($child instanceOf DOMElement){
            switch($child->nodeName){
            case 'br':
                echo 'BREAK: ',PHP_EOL;
                break;
            case 'p':
                echo 'PARAGRAPH: ',PHP_EOL;
                processElement($child);
                echo 'END OF PARAGRAPH;',PHP_EOL;
                break;
            // etc.
            // other cases:
            default:
                processElement($child);
            }
        }
    }
}

$D = new DOMDocument;
$D->loadHTML('<p>Multiple-line paragraph<br />that has a close tag</p>');
processElement($D);

This will output:

PARAGRAPH: 
Multiple-line paragraph
BREAK:
that has a close tag
END OF PARAGRAPH;
like image 198
Hrant Khachatrian Avatar answered Dec 24 '22 03:12

Hrant Khachatrian