Trying to use HTML DOM parser to get main image on Amazon page

Question

I'm trying to use HTML DOM Parser to get the image source of the "main" product image no matter what product page the parser is being pointed to.

On every page it seems that that image has the id "landingImage". You would think that this should do the trick:

$finalarray[$i][2] = $html->find('img[id="landingImage"]', 0)->src;

But no such luck.

I also tried

    foreach($html->find('img') as $e)
    if (strpos($e,'landingImage') !== false) { 
        $finalarray[$i][2] = $e->src;
    }

I noticed that usually the image source has SY300 or SX300 so I did this:

    foreach($html->find('img') as $e)
    if (strpos($e,'SX300') !== false) { 
        $finalarray[$i][2] = $e->src;
    }
    else if (strpos($e,'SY300') !== false) { 
        $finalarray[$i][2] = $e->src;
    }

Unfortunately some image source links don't contain that, example:

http://www.amazon.com/gp/product/B001O21H00/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=B001O21H00&linkCode=as2&tag=bmref-20

Henrik · Accepted Answer

Using the Amazon API might be the better solution, but this is not the question.

As I downloaded the html from the sample web page (content without running JavaScript), I could not find any tag with id="landingImage"^[1]. But I could find an image tag with id="main-image". Trying to extract this tag with DOMDocument wasn't successful. Somehow the methods loadHTML() and loadHTMLFile() were't able to parse the html.

But the interesting part can be extracted with a regular expression. The following code will give you the image source:

$url = 'http://www.amazon.com/gp/product/B001O21H00/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=B001O21H00&linkCode=as2&tag=bmref-20';
$html = file_get_contents($url);

$matches = array();
if (preg_match('#<img[^>]*id="main-image"[^>]*src="(.*?)"[^>]*>#', $html, $matches)) {
    $src = $matches[1];
}

// The source of the image is
// $src: 'http://ecx.images-amazon.com/images/I/21JzKZ9%2BYGL.jpg'

^[1] The html source was downloaded within php with the function file_get_contents. Downloading the html source with Firefox results in a different html code. In the last case you will find an image tag with the id attribute "landingImage" (JavaScript is NOT enabled!). It seems that the downloaded html source depends on the client (headers in the http request).

newman · Answer

On page with your example img tag with id="landingImage" don't contains attribute src. This attribute is added by JavaScript.

But this tag contains attribute data-a-dynamic-image with value {"http://ecx.images-amazon.com/images/I/21JzKZ9%2BYGL.jpg":[200,200]}

You can try get value for this attribute and then just parse value. By regexp or by strpos and substr functions.

Trying to use HTML DOM parser to get main image on Amazon page

Tags:

html

dom

php

parsing

amazon

user3312242

2 Answers

Henrik

newman

Recent Activity

Donate For Us

Trying to use HTML DOM parser to get main image on Amazon page

Tags:

html

dom

php

parsing

amazon

user3312242

2 Answers

Henrik

newman

Related questions

Recent Activity

Donate For Us