Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trying to use HTML DOM parser to get main image on Amazon page

I'm trying to use HTML DOM Parser to get the image source of the "main" product image no matter what product page the parser is being pointed to.

On every page it seems that that image has the id "landingImage". You would think that this should do the trick:

$finalarray[$i][2] = $html->find('img[id="landingImage"]', 0)->src;

But no such luck.

I also tried

    foreach($html->find('img') as $e)
    if (strpos($e,'landingImage') !== false) { 
        $finalarray[$i][2] = $e->src;
    }

I noticed that usually the image source has SY300 or SX300 so I did this:

    foreach($html->find('img') as $e)
    if (strpos($e,'SX300') !== false) { 
        $finalarray[$i][2] = $e->src;
    }
    else if (strpos($e,'SY300') !== false) { 
        $finalarray[$i][2] = $e->src;
    }

Unfortunately some image source links don't contain that, example:

http://www.amazon.com/gp/product/B001O21H00/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=B001O21H00&linkCode=as2&tag=bmref-20
like image 420
user3312242 Avatar asked Feb 18 '14 01:02

user3312242


2 Answers

Using the Amazon API might be the better solution, but this is not the question.

As I downloaded the html from the sample web page (content without running JavaScript), I could not find any tag with id="landingImage"[1]. But I could find an image tag with id="main-image". Trying to extract this tag with DOMDocument wasn't successful. Somehow the methods loadHTML() and loadHTMLFile() were't able to parse the html.

But the interesting part can be extracted with a regular expression. The following code will give you the image source:

$url = 'http://www.amazon.com/gp/product/B001O21H00/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=B001O21H00&linkCode=as2&tag=bmref-20';
$html = file_get_contents($url);

$matches = array();
if (preg_match('#<img[^>]*id="main-image"[^>]*src="(.*?)"[^>]*>#', $html, $matches)) {
    $src = $matches[1];
}

// The source of the image is
// $src: 'http://ecx.images-amazon.com/images/I/21JzKZ9%2BYGL.jpg'

[1] The html source was downloaded within php with the function file_get_contents. Downloading the html source with Firefox results in a different html code. In the last case you will find an image tag with the id attribute "landingImage" (JavaScript is NOT enabled!). It seems that the downloaded html source depends on the client (headers in the http request).

like image 193
Henrik Avatar answered Oct 22 '22 15:10

Henrik


On page with your example img tag with id="landingImage" don't contains attribute src. This attribute is added by JavaScript.

But this tag contains attribute data-a-dynamic-image with value {"http://ecx.images-amazon.com/images/I/21JzKZ9%2BYGL.jpg":[200,200]}

You can try get value for this attribute and then just parse value. By regexp or by strpos and substr functions.

like image 34
newman Avatar answered Oct 22 '22 15:10

newman