Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get background image with Nokogiri from DOM?

I'm scraping a site and I can't get the images, because they are loaded with background-image CSS.

Is there a way to get these attributes with Nokogiri without having to use Phantom.js or Sentinel? The background-image actually uses inline-styles so I should be able to.

I have to get images from an array of URLS:

<div class="zoomLens" style="background-image: url(http://resources1.okadirect.com/assets/en/new/catalogue/1200x1200/EHD005MET-L_01.jpg?version=7); background-position: -14.7368421052632px -977.894736842105px; background-repeat: no-repeat;">&nbsp;</div>

I'm using Nokogiri via Mechanize, but don't know how to write this correctly:

image = agent.get(doc.parser.at('.zoomLens')["background-image"]).save("okaimages/f_deco-#{counter}.jpg")
like image 214
Gibson Avatar asked Jan 29 '15 16:01

Gibson


1 Answers

I'd use something like:

require 'nokogiri'

doc = Nokogiri::HTML('<div class="zoomLens" style="background-image: url(http://resources1.okadirect.com/assets/en/new/catalogue/1200x1200/EHD005MET-L_01.jpg?version=7); background-position: -14.7368421052632px -977.894736842105px; background-repeat: no-repeat;">&nbsp;</div>')

doc.search('.zoomLens').map{ |n| n['style'][/url\((.+)\)/, 1] }
# => ["http://resources1.okadirect.com/assets/en/new/catalogue/1200x1200/EHD005MET-L_01.jpg?version=7"]

The trick is the appropriate pattern to grab the contents of the parenthesis. n['style'][/url\((.+)\)/, 1] is using String#[] which can take a regular expression with grouping, and return a particular group from the captures. See https://www.regex101.com/r/mV6rY6/1 for a breakdown of what its doing.

At that point you'd be sitting on an array of image URLs. You can easily iterate over the list and use OpenURI or any number of other HTTP clients to retrieve the images.

like image 172
the Tin Man Avatar answered Nov 15 '22 06:11

the Tin Man