Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Retrieve elements with xpath and DOMDocument

I have a list of ads in the html code below. What I need is a PHP loop to get the folowing elements for each ad:

  1. ad URL (href attribute of <a> tag)
  2. ad image URL (src attribute of <img> tag)
  3. ad title (html content of <div class="title"> tag)
<div class="ads">
    <a href="http://path/to/ad/1">
        <div class="ad">
            <div class="image">
                <div class="wrapper">
                    <img src="http://path/to/ad/1/image.jpg">
                </div>
            </div>
            <div class="detail">
                <div class="title">Ad #1</div>
            </div>
        </div>
    </a>
    <a href="http://path/to/ad/2">
        <div class="ad">
            <div class="image">
                <div class="wrapper">
                    <img src="http://path/to/ad/2/image.jpg">
                </div>
            </div>
            <div class="detail">
                <div class="title">Ad #2</div>
            </div>
        </div>
    </a>
</div>

I managed to get the ad URL with the PHP code below.

$d = new DOMDocument();
$d->loadHTML($ads); // the variable $ads contains the HTML code above
$xpath = new DOMXPath($d);
$ls_ads = $xpath->query('//a');

foreach ($ls_ads as $ad) {
    $ad_url = $ad->getAttribute('href');
    print("AD URL : $ad_url");
}

But I didn't manage to get the 2 other elements (image url and title). Any idea?

like image 416
user1691355 Avatar asked Sep 22 '12 20:09

user1691355


People also ask

What is/* in XPath?

/* selects the root element, regardless of name. ./* or * selects all child elements of the context node, regardless of name.

Which is fast dom or XPath?

My personal experience is, DOM are usually more than 10 times faster than XPath or selector API implementation (e.g. Firefox). However, since XPath accept context node, maybe it is best to select a "stable" parent node with DOM and use XPath for the rest job. This can be both high performance and robust.

Where do I put XPath in Dom?

For Relative XPath, the path starts from the middle of the HTML DOM structure. It starts with the double forward slash (//), which means it can search the element anywhere at the webpage. You can start from the middle of the HTML DOM structure with no need to write a long XPath.


1 Answers

I managed to get what I need with this code (based on Khue Vu's code) :

$d = new DOMDocument();
$d->loadHTML($ads); // the variable $ads contains the HTML code above
$xpath = new DOMXPath($d);
$ls_ads = $xpath->query('//a');

foreach ($ls_ads as $ad) {
    // get ad url
    $ad_url = $ad->getAttribute('href');

    // set current ad object as new DOMDocument object so we can parse it
    $ad_Doc = new DOMDocument();
    $cloned = $ad->cloneNode(TRUE);
    $ad_Doc->appendChild($ad_Doc->importNode($cloned, True));
    $xpath = new DOMXPath($ad_Doc);

    // get ad title
    $ad_title_tag = $xpath->query("//div[@class='title']");
    $ad_title = trim($ad_title_tag->item(0)->nodeValue);

    // get ad image
    $ad_image_tag = $xpath->query("//img/@src");
    $ad_image = $ad_image_tag->item(0)->nodeValue;
}
like image 170
user1691355 Avatar answered Sep 28 '22 05:09

user1691355