Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between Scrapy selectors "a::text" and "a ::text"

I've created a scraper to grab some product names from a webpage. It is working smoothly. I've used CSS selectors to do the job. However, the only thing I can't understand is the difference between the selectors a::text and a ::text (don't overlook the space between a and ::text in the latter). When I run my script, I get the same exact result no matter which selector I choose.

import requests
from scrapy import Selector

res = requests.get("https://www.kipling.com/uk-en/sale/type/all-sale/?limit=all#")
sel = Selector(res)
for item in sel.css(".product-list-product-wrapper"):
    title = item.css(".product-name a::text").extract_first().strip()
    title_ano = item.css(".product-name a ::text").extract_first().strip()
    print("Name: {}\nName_ano: {}\n".format(title,title_ano))

As you can see, both title and title_ano contain the same selector, bar the space in the latter. Nevertheless, the results are always the same.

My question: is there any substantial difference between the two, and when should I use the former and when the latter?

like image 242
SIM Avatar asked Feb 01 '18 06:02

SIM


Video Answer


1 Answers

Interesting observation! I spent the past couple of hours investigating this and it turns out, there's a lot more to it than meets the eye.

If you're coming from CSS, you'd probably expect to write a::text in much the same way you'd write a::first-line, a::first-letter, a::before or a::after. No surprises there.

On the other hand, standard selector syntax would suggest that a ::text matches the ::text pseudo-element of a descendant of the a element, making it equivalent to a *::text. However, .product-list-product-wrapper .product-name a doesn't have any child elements, so by rights, a ::text is supposed to match nothing. The fact that it does match suggests that Scrapy is not following the grammar.

Scrapy uses Parsel (itself based on cssselect) to translate selectors into XPath, which is where ::text comes from. With that in mind, let's examine how Parsel implements ::text:

>>> from parsel import css2xpath
>>> css2xpath('a::text')
'descendant-or-self::a/text()'
>>> css2xpath('a ::text')
'descendant-or-self::a/descendant-or-self::text()'

So, like cssselect, anything that follows a descendant combinator is translated into a descendant-or-self axis, but because text nodes are proper children of element nodes in the DOM, ::text is treated as a standalone node and converted directly to text(), which, with the descendant-or-self axis, matches any text node that is a descendant of an a element, just as a/text() matches any text node child of an a element (a child is also a descendant).

Egregiously, this happens even when you add an explicit * to the selector:

>>> css2xpath('a *::text')
'descendant-or-self::a/descendant-or-self::text()'

However, the use of the descendant-or-self axis means that a ::text can match all text nodes in the a element, including those in other elements nested within the a. In the following example, a ::text will match two text nodes: 'Link ' followed by 'text':

<a href="https://example.com">Link <span>text</span></a>

So while Scrapy's implementation of ::text is an egregious violation of the Selectors grammar, it seems to have been done this way very much intentionally.

In fact, Scrapy's other pseudo-element ::attr()1 behaves similarly. The following selectors all match the id attribute node belonging to the div element when it does not have any descendant elements:

>>> css2xpath('div::attr(id)')
'descendant-or-self::div/@id'
>>> css2xpath('div ::attr(id)')
'descendant-or-self::div/descendant-or-self::*/@id'
>>> css2xpath('div *::attr(id)')
'descendant-or-self::div/descendant-or-self::*/@id'

... but div ::attr(id) and div *::attr(id) will match all id attribute nodes within the div's descendants along with its own id attribute, such as in the following example:

<div id="parent"><p id="child"></p></div>

This, of course, is a much less plausible use case, so one has to wonder if this was an unintentional side effect of the implementation of ::text.

Compare the pseudo-element selectors to one that substitutes any simple selector for the pseudo-element:

>>> css2xpath('a [href]')
'descendant-or-self::a/descendant-or-self::*/*[@href]'

This correctly translates the descendant combinator to descendant-or-self::*/* with an additional implicit child axis, ensuring that the [@href] predicate is never tested on the a element.

If you're new to XPath, Selectors, or even Scrapy, this may all seem very confusing and overwhelming. So here's a summary of when to use one selector over the other:

  • Use a::text if your a element contains only text, or if you're only interested in the top-level text nodes of this a element and not its nested elements.

  • Use a ::text if your a element contains nested elements and you want to extract all the text nodes within this a element.

    While you can use a ::text if your a element contains only text, its syntax is confusing, so for the sake of consistency, use a::text instead.


1On an interesting note, ::attr() appears in the (abandoned as of 2021) Non-element Selectors spec, where as you'd expect it behaves consistently with the Selectors grammar, making its behavior in Scrapy inconsistent with the spec. ::text on the other hand is conspicuously missing from the spec; based on this answer, I think you can make a reasonable guess as to why.

like image 66
BoltClock Avatar answered Oct 16 '22 16:10

BoltClock