Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse HTML and Get All h3's After an h2 Before the Next h2 Using PHP

I am looking to find the first h2 in the article. Once found, look for all h3's until the next h2 is found. Rinse and repeat until all headings and subheadings have been located.

Before you immediately flag or close this question as duplicate parsing question, please take note of the question title, as for this isn't about basic node retrieval. I've got that part down.

I am using DOMDocument to parse HTML using DOMDocument::loadHTML(), DOMDocument::getElementsByTagName() and DOMDocument::saveHTML() to retrieve the important headings of an article.

My code is as follows:

$matches = array();
$dom = new DOMDocument;
$dom->loadHTML($content);
foreach($dom->getElementsByTagName('h2') as $node) {
    $matches['heading-two'][] = $dom->saveHtml($node);
}
foreach($dom->getElementsByTagName('h3') as $node) {
    $matches['heading-three'][] = $dom->saveHtml($node);
}
if($matches){
    $this->key_points = $matches;
}

Which gives me an output of something like:

array(
    'heading-two' => array(
        '<h2>Here is the first heading two</h2>',
        '<h2>Here is the SECOND heading two</h2>'
    ),
    'heading-three' => array(
        '<h3>Here is the first h3</h3>',
        '<h3>Here is the second h3</h3>',
        '<h3>Here is the third h3</h3>',
        '<h3>Here is the fourth h3</h3>',
    )
);

I'm looking to have something more like:

array(
    '<h2>Here is the first heading two</h2>' => array(
        '<h3>Here is an h3 under the first h2</h3>',
        '<h3>Here is another h3 found under first h2, but after the first h3</h3>'
    ),
    '<h2>Here is the SECOND heading two</h2>' => array(
        '<h3>Here is an h3 under the SECOND h2</h3>',
        '<h3>Here is another h3 found under SECOND h2, but after the first h3</h3>'
    )
);

I'm not exactly looking for code completion (if you feel it would better help others by doing so -- go ahead), but more or less guidance or advice in the right direction to accomplish a nested array like directly above above.

like image 589
Michael Ecklund Avatar asked Aug 09 '13 21:08

Michael Ecklund


1 Answers

I assume that all headings are on the same level in DOM, so every h3 is sibling of h2. With that assumption , you can iterate over siblings of h2 until next h2 is encountered:

foreach($dom->getElementsByTagName('h2') as $node) {
    $key = $dom->saveHtml($node);
    $matches[$key] = array();
    while(($node = $node->nextSibling) && $node->nodeName !== 'h2') {
        if($node->nodeName == 'h3') {
            $matches[$key][] = $dom->saveHtml($node);   
        }
    }
}
like image 106
dev-null-dweller Avatar answered Nov 03 '22 02:11

dev-null-dweller