I'm trying to use HTML DOM Parser to get the image source of the "main" product image no matter what product page the parser is being pointed to.
On every page it seems that that image has the id "landingImage". You would think that this should do the trick:
$finalarray[$i][2] = $html->find('img[id="landingImage"]', 0)->src;
But no such luck.
I also tried
foreach($html->find('img') as $e)
if (strpos($e,'landingImage') !== false) {
$finalarray[$i][2] = $e->src;
}
I noticed that usually the image source has SY300 or SX300 so I did this:
foreach($html->find('img') as $e)
if (strpos($e,'SX300') !== false) {
$finalarray[$i][2] = $e->src;
}
else if (strpos($e,'SY300') !== false) {
$finalarray[$i][2] = $e->src;
}
Unfortunately some image source links don't contain that, example:
http://www.amazon.com/gp/product/B001O21H00/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=B001O21H00&linkCode=as2&tag=bmref-20
Using the Amazon API might be the better solution, but this is not the question.
As I downloaded the html from the sample web page (content without running JavaScript), I could not find any tag with id="landingImage"
[1]. But I could find an image tag with id="main-image"
. Trying to extract this tag with DOMDocument wasn't successful. Somehow the methods loadHTML()
and loadHTMLFile()
were't able to parse the html.
But the interesting part can be extracted with a regular expression. The following code will give you the image source:
$url = 'http://www.amazon.com/gp/product/B001O21H00/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=B001O21H00&linkCode=as2&tag=bmref-20';
$html = file_get_contents($url);
$matches = array();
if (preg_match('#<img[^>]*id="main-image"[^>]*src="(.*?)"[^>]*>#', $html, $matches)) {
$src = $matches[1];
}
// The source of the image is
// $src: 'http://ecx.images-amazon.com/images/I/21JzKZ9%2BYGL.jpg'
[1] The html source was downloaded within php with the function file_get_contents
. Downloading the html source with Firefox results in a different html code. In the last case you will find an image tag with the id attribute "landingImage" (JavaScript is NOT enabled!). It seems that the downloaded html source depends on the client (headers in the http request).
On page with your example img tag with id="landingImage"
don't contains attribute src. This attribute is added by JavaScript.
But this tag contains attribute data-a-dynamic-image
with value {"http://ecx.images-amazon.com/images/I/21JzKZ9%2BYGL.jpg":[200,200]}
You can try get value for this attribute and then just parse value. By regexp or by strpos and substr functions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With