I'm scraping a site and I can't get the images, because they are loaded with background-image CSS.
Is there a way to get these attributes with Nokogiri without having to use Phantom.js or Sentinel? The background-image actually uses inline-styles so I should be able to.
I have to get images from an array of URLS:
<div class="zoomLens" style="background-image: url(http://resources1.okadirect.com/assets/en/new/catalogue/1200x1200/EHD005MET-L_01.jpg?version=7); background-position: -14.7368421052632px -977.894736842105px; background-repeat: no-repeat;"> </div>
I'm using Nokogiri via Mechanize, but don't know how to write this correctly:
image = agent.get(doc.parser.at('.zoomLens')["background-image"]).save("okaimages/f_deco-#{counter}.jpg")
I'd use something like:
require 'nokogiri'
doc = Nokogiri::HTML('<div class="zoomLens" style="background-image: url(http://resources1.okadirect.com/assets/en/new/catalogue/1200x1200/EHD005MET-L_01.jpg?version=7); background-position: -14.7368421052632px -977.894736842105px; background-repeat: no-repeat;"> </div>')
doc.search('.zoomLens').map{ |n| n['style'][/url\((.+)\)/, 1] }
# => ["http://resources1.okadirect.com/assets/en/new/catalogue/1200x1200/EHD005MET-L_01.jpg?version=7"]
The trick is the appropriate pattern to grab the contents of the parenthesis. n['style'][/url\((.+)\)/, 1]
is using String#[]
which can take a regular expression with grouping, and return a particular group from the captures. See https://www.regex101.com/r/mV6rY6/1 for a breakdown of what its doing.
At that point you'd be sitting on an array of image URLs. You can easily iterate over the list and use OpenURI or any number of other HTTP clients to retrieve the images.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With