Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to locate elements by CSS properties in Scrapy?

Am wondering if Scrapy has methods to scrape data based on their colors defined in CSS. For example, select all elements with background-color: #ff0000.

I have tried this:

response.css('td::attr(background-color)').extract()

I was expecting a list with all background colors set for the table data elements but it returns an empty list.

Is it generally possible to locate elements by their CSS properties in Scrapy?

like image 603
user3445792 Avatar asked Sep 24 '14 22:09

user3445792


1 Answers

Short answer is No, this is not possible to do with Scrapy alone.

Why No?

  • the :attr() selector allows you to access element attributes, but background-color is a CSS property.

  • an important thing to understand now is that there are multiple different ways to define CSS properties of elements on a page and, to actually get a CSS property value of an element, you need a browser to fully render the page and all the defined stylesheets

  • Scrapy itself is not a browser, not a javascript engine, it is not able to render a page

Exceptions

Sometimes, though, CSS properties are defined in style attributes of the elements. For instance:

<span style="background-color: green"/>

If this is the case, when, yes, you would be able to use the style attributes value to filter elements:

response.xpath("//span[contains(@style, 'background-color: green')]")

This would though be quite fragile and may generate false positives.

What can you do?

  • look for other things to base your locators on. In general, strictly speaking, locating elements by the background color is not the best way to get to the desired elements unless, in some unusual circumstances, this property is the only distinguishing factor
  • scrapy-splash project allows you to automate a lightweight Splash browser which may render the page. In that case, you would need some Lua scripts to be executed to access CSS properties of elements on a rendered page
  • selenium browser automation tool is probably the most straightforward tool for this problem as it gives you direct control and access to the page and its elements and their properties and attributes. There is this .value_of_css_property() method to get a value of a CSS property.
like image 155
alecxe Avatar answered Oct 22 '22 07:10

alecxe