I'm trying to use Zend_Dom for some very light screen scraping (I want to grab a headline, some body text and a link from a small block of news items on my website) and I'm not sure how to handle the DOMElement that it gives me.
In the manual for Zend_Dom the code says:
foreach ($results as $result) {
// $result is a DOMElement
}
How do I make use of this DOMElement?
A detailed example (looking for the anchor elements on Google):
$url='http://google.com/';
$client = new Zend_Http_Client($url);
$response = $client->request();
$html = $response->getBody();
$dom = new Zend_Dom_Query($html);
$results = $dom->query('a');
foreach($results as $r){
Zend_Debug::dump($r);
}
This gives me:
object(DOMElement)#81 (0) {
}
object(DOMElement)#82 (0) {
}
object(DOMElement)#83 (0) {
}
... etc, etc...
What I find confusing is that this looks like each element contains nothing (0)! This isn't the case but that is my first impression. So I poke around online and find I can add nodeValue
to get something out of this:
Zend_Debug::dump($r->nodeValue);
which gives me:
string(6) "Images"
string(6) "Videos"
string(4) "Maps"
...etc, etc...
But where I run into trouble is getting specific elements and their contents.
For instance given this html:
<div class="newsBlurb">
<span class="newsDate">Mon, 11 October 2010</span>
<h3 class="newsHeadline"><a href="http://foo.com/1/2/">Some text</a></h3>
<a class="newsMore" href="http://foo.com/1/2/">More</a>
</div>
<div class="hr"></div>
<div class="newsBlurb">
<span class="newsDate">Mon, 16 August 2010</span>
<h3 class="newsHeadline"><a href="http://bar.com/pants.html">Stuff is here</a></h3>
<a class="newsMore" href="http://bar.com/pants.html">More</a>
</div>
I can grab the text from each newsBlurb, using the technique I use in the Google example, but cannot get each element by itself. I want to get the date and stick it somewhere, get the headline text and stick it somewhere and get the link to use. But all I get is the actual text in the div.
How do I get what I want from this?
EDIT Here is another example that does not work as I expect. Any ideas why?
$url = 'http://php.net/manual/en/class.domelement.php';
$client = new Zend_Http_Client($url);
$response = $client->request();
$html = $response->getBody();
$dom = new Zend_Dom_Query($html);
$newsBlurbNode = $dom->query('div.note');
Zend_Debug::dump($newsBlurbNode);
this gives me:
object(Zend_Dom_Query_Result)#867 (7) {
["_count":protected] => NULL
["_cssQuery":protected] => string(8) "div.note"
["_document":protected] => object(DOMDocument)#79 (0) {
}
["_nodeList":protected] => object(DOMNodeList)#864 (0) {
}
["_position":protected] => int(0)
["_xpath":protected] => NULL
["_xpathQuery":protected] => string(33) "//div[contains(@class, ' note ')]"
}
Trying to get anything out of this I used:
$children = $newsBlurbNode->childNodes;
foreach ($children as $child) {
}
Which results in an error because the foreach loop has nothing in it. Ack! What am I not getting?
You can use something like this to get access to the individual nodes:
$children = $newsBlurbNode->childNodes;
foreach ($children as $child) {
//do something with individual nodes
}
Otherwise I would go through: http://php.net/manual/en/class.domelement.php
Hey I have been messing around with something similar - let me know if this is sufficient to help you out - if not I can explain it some more.
$data = "<p id='p_1'><a href='testing1.html'><span>testing in a span 1</span></a></p>
<p id='p_2'><a href='testing2.html'></a></p>
<p id='p_3'><a href='testing3.html'><span>testing in a span 3</span></a></p>
<p id='p_4'><a href='testing4.html'><span>testing in a span 4</span></a></p>
<p id='p_5'><a href='testing5.html'><span>testing in a span 5</span></a></p>";
$dom = new Zend_Dom_Query();
$dom->setDocumentHtml($data);
//Look for any links inside of paragraph tags
$results = $dom->query('p a');
foreach($results as $r){
echo "Parent Tag: ".$r->nodeName."<br />";
echo $r->nodeValue."<br />";
$children = $r->childNodes;
if($children->length > 0){
$children = $r->childNodes;
foreach($children as $c){
echo "Child Tag: <br />";
echo $c->nodeName."<br />";
echo $c->nodeValue."<br />";
}
}
echo $r->getAttribute('href')."<br /><br />";
}
echo $data;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With