Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

<tbody> glitch in PHP Simple HTML DOM parser

I'm using PHP Simple HTML DOM Parser to scrape some data of a webshop (also running XAMPP 1.7.2 with PHP5.3.0), and I'm running into problems with <tbody> tag. The structure of the table is, essentialy (details aren't really that important):

<table>
  <thead>
    <!--text here-->
  </thead>
  <tbody>
    <!--text here-->
  </tbody>
</table>

Now, I'm trying to get to the <tbody> section by using code:

$element = $html->find('tbody',0)->innertext;

It doesn't throw any errors, it just prints nothing out when I try to echo it. I've tested the code on other elements, <thead>, <table>, even something like <span class="price"> and they all work fine (ofcourse, removing ",0" fails the code). They all give their correct sections. Outertext ditto. But it all fails on <tbody>.

Now, I've skimmed through the Parser, but I'm not sure I can figure it out. I've noticed that <thead> isn't even mentioned, but it works fine. shrug

I guess I could try and do child navigation, but that seems to glitch as well. I've just tried running:

$el = $html->find('table',0);
$el2 = $el->children(2);
echo $el2->outertext;

and no dice. Tried replacing children with first_child and 2 with 1, and still no dice. Funny, though, if I try ->find instead of children, it works perfectly.

I'm pretty confident I could find a work-around the whole thing, but this behaviour seems odd enough to post here. My curious mind is happy for all the help it can get.

like image 816
thevoiddancer Avatar asked Feb 26 '10 10:02

thevoiddancer


2 Answers

in simple_html_dom.php file comment or remove line #396

// if ($m[1]==='tbody') continue;
like image 68
user492589 Avatar answered Sep 23 '22 18:09

user492589


There is a bug report for this issue here: http://sourceforge.net/p/simplehtmldom/bugs/79/

It is still open at the time of this writing. There is an alternative fix if you do not wish to modify the source code, for example in a loop to find <tr>'s

<?php
  // The *BROKEN* way to find the <tr>'s 
  // below the <tbody> below the <table id="foo">
  foreach($dom->find('tbl#foo tbody tr') as $tr) {
    /* you will get nothing */
  }

You can instead selectively check the parent tag name while iterating all <tr>'s like so:

<?php
  // A workaround to find the <tr>'s 
  // below the <tbody> below the <table id="foo">
  foreach($dom->find('tbl#foo tr') as $tr) { // note the lack of tbody selector
    /* you will get all trs, but let's only work with ones with the parent
       of a tbody! */
    if($tr->parent->tag == 'tbody') { // our workaround
      /* this part will work as you would expect the above broken code to work */
    }
  }

Also note, a slightly unrelated issue that I ran into, that Chrome and FF inspectors will correct tag soup regarding<tbody> and <thead>. Be careful -- only look at the actual source -- stay away from the DOM inspectors if you run into unexplainable issues.

like image 3
A.B. Carroll Avatar answered Sep 23 '22 18:09

A.B. Carroll