Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP Simple DOMDocument scraping exclude td class

Im simply trying to get all the <td> elements data residing inside <tr> elements. My problem is because of the table structure im trying to scrape I need to exclude all elements with attribute COLLSPAN i.e <td collspan = 12> Getting the table data is simple enough as can be seen from below code but because of the table structure I need to exclude all collspan attributes.

<?php

$html = file_get_contents('http://www.superxv.com/fixtures/'); //get the html returned from the following url

$game_doc = new DOMDocument();
libxml_use_internal_errors(TRUE); //disable libxml errors
if(!empty($html)) { //if any html is actually returned
    $game_doc->loadHTML($html);
    libxml_clear_errors(); //remove error
    $xpath = new DOMXPath($game_doc);

    // Modify the XPath query to match the content
    foreach ($xpath->query('//table')->item(0)->getElementsByTagName('tr') as $rows) {
        $cells = $rows->getElementsByTagName('td');
        //$cells2 = $rows->getElementsByTagName('th');
        echo '<pre>';
         //@ signs are added due to table structure
        //Get scrapped columns
        echo $dayDateBye[] = $cells->item(0)->textContent;
        echo $homeTeam[] = $cells->item(1)->textContent;
        echo $awayTeam[] = $cells->item(2)->textContent;
        echo $venue[] = $cells->item(3)->textContent;
        echo $timeGMT[] = $cells->item(5)->textContent;
        echo $timeZA[] = $cells->item(10)->textContent;
        echo '</pre>';
    }
}

Here you can see the table structure it shows 5 odd rows of fixtures and then changes structure when the new week starts. The elements I can identify to skip over this change of structure is all <td collspan = 12> elements. Which makes it tricky since the TD elements does not have a class name only the element to identify it with.

enter image description here

enter image description here

Any input appreciated.

like image 857
Timothy Coetzee Avatar asked Feb 18 '26 10:02

Timothy Coetzee


2 Answers

You can skip those by length of the tag

<?php

$html = file_get_contents('http://www.superxv.com/fixtures/'); //get the html returned from the following url

$game_doc = new DOMDocument();
libxml_use_internal_errors(TRUE); //disable libxml errors
if(!empty($html)) { //if any html is actually returned
    $game_doc->loadHTML($html);
    libxml_clear_errors(); //remove error
    $xpath = new DOMXPath($game_doc);

    // Modify the XPath query to match the content
    foreach ($xpath->query('//table')->item(0)->getElementsByTagName('tr') as $rows) {
        $cells = $rows->getElementsByTagName('td');
        if( $cells->length > 1 ){
            //$cells2 = $rows->getElementsByTagName('th');
            echo '<pre>';
             //@ signs are added due to table structure
            //Get scrapped columns
            echo $dayDateBye[] = $cells->item(0)->textContent;
            echo $homeTeam[] = $cells->item(1)->textContent;
            echo $awayTeam[] = $cells->item(2)->textContent;
            echo $venue[] = $cells->item(3)->textContent;
            echo $timeGMT[] = $cells->item(5)->textContent;
            echo $timeZA[] = $cells->item(10)->textContent;
            echo '</pre>';
        }
    }
}

?>
like image 140
Dhayal Ram Avatar answered Feb 20 '26 00:02

Dhayal Ram


use xpath to exclude elements with colspan attributes

So instead of:

$cells = $rows->getElementsByTagName('td');

Use:

$cells = $xpath->query('td[not(@colspan)]', $rows);
like image 31
Jeff Puckett Avatar answered Feb 20 '26 01:02

Jeff Puckett



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!