I'm using PHP Simple HTML DOM Parser to scrape some data of a webshop (also running XAMPP 1.7.2 with PHP5.3.0), and I'm running into problems with <tbody>
tag. The structure of the table is, essentialy (details aren't really that important):
<table>
<thead>
<!--text here-->
</thead>
<tbody>
<!--text here-->
</tbody>
</table>
Now, I'm trying to get to the <tbody>
section by using code:
$element = $html->find('tbody',0)->innertext;
It doesn't throw any errors, it just prints nothing out when I try to echo it. I've tested the code on other elements, <thead>
, <table>
, even something like <span class="price">
and they all work fine (ofcourse, removing ",0" fails the code). They all give their correct sections. Outertext ditto. But it all fails on <tbody>
.
Now, I've skimmed through the Parser, but I'm not sure I can figure it out. I've noticed that <thead>
isn't even mentioned, but it works fine. shrug
I guess I could try and do child navigation, but that seems to glitch as well. I've just tried running:
$el = $html->find('table',0);
$el2 = $el->children(2);
echo $el2->outertext;
and no dice. Tried replacing children
with first_child
and 2 with 1, and still no dice. Funny, though, if I try ->find
instead of children
, it works perfectly.
I'm pretty confident I could find a work-around the whole thing, but this behaviour seems odd enough to post here. My curious mind is happy for all the help it can get.
in simple_html_dom.php file comment or remove line #396
// if ($m[1]==='tbody') continue;
There is a bug report for this issue here: http://sourceforge.net/p/simplehtmldom/bugs/79/
It is still open at the time of this writing. There is an alternative fix if you do not wish to modify the source code, for example in a loop to find <tr>
's
<?php
// The *BROKEN* way to find the <tr>'s
// below the <tbody> below the <table id="foo">
foreach($dom->find('tbl#foo tbody tr') as $tr) {
/* you will get nothing */
}
You can instead selectively check the parent tag name while iterating all <tr>
's like so:
<?php
// A workaround to find the <tr>'s
// below the <tbody> below the <table id="foo">
foreach($dom->find('tbl#foo tr') as $tr) { // note the lack of tbody selector
/* you will get all trs, but let's only work with ones with the parent
of a tbody! */
if($tr->parent->tag == 'tbody') { // our workaround
/* this part will work as you would expect the above broken code to work */
}
}
Also note, a slightly unrelated issue that I ran into, that Chrome and FF inspectors will correct tag soup regarding<tbody>
and <thead>
. Be careful -- only look at the actual source -- stay away from the DOM inspectors if you run into unexplainable issues.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With