I've created a scraper to grab some product names from a webpage. It is working smoothly. I've used CSS selectors to do the job. However, the only thing I can't understand is the difference between the selectors a::text
and a ::text
(don't overlook the space between a
and ::text
in the latter). When I run my script, I get the same exact result no matter which selector I choose.
import requests
from scrapy import Selector
res = requests.get("https://www.kipling.com/uk-en/sale/type/all-sale/?limit=all#")
sel = Selector(res)
for item in sel.css(".product-list-product-wrapper"):
title = item.css(".product-name a::text").extract_first().strip()
title_ano = item.css(".product-name a ::text").extract_first().strip()
print("Name: {}\nName_ano: {}\n".format(title,title_ano))
As you can see, both title
and title_ano
contain the same selector, bar the space in the latter. Nevertheless, the results are always the same.
My question: is there any substantial difference between the two, and when should I use the former and when the latter?
Interesting observation! I spent the past couple of hours investigating this and it turns out, there's a lot more to it than meets the eye.
If you're coming from CSS, you'd probably expect to write a::text
in much the same way you'd write a::first-line
, a::first-letter
, a::before
or a::after
. No surprises there.
On the other hand, standard selector syntax would suggest that a ::text
matches the ::text
pseudo-element of a descendant of the a
element, making it equivalent to a *::text
. However, .product-list-product-wrapper .product-name a
doesn't have any child elements, so by rights, a ::text
is supposed to match nothing. The fact that it does match suggests that Scrapy is not following the grammar.
Scrapy uses Parsel (itself based on cssselect) to translate selectors into XPath, which is where ::text
comes from. With that in mind, let's examine how Parsel implements ::text
:
>>> from parsel import css2xpath
>>> css2xpath('a::text')
'descendant-or-self::a/text()'
>>> css2xpath('a ::text')
'descendant-or-self::a/descendant-or-self::text()'
So, like cssselect, anything that follows a descendant combinator is translated into a descendant-or-self
axis, but because text nodes are proper children of element nodes in the DOM, ::text
is treated as a standalone node and converted directly to text()
, which, with the descendant-or-self
axis, matches any text node that is a descendant of an a
element, just as a/text()
matches any text node child of an a
element (a child is also a descendant).
Egregiously, this happens even when you add an explicit *
to the selector:
>>> css2xpath('a *::text')
'descendant-or-self::a/descendant-or-self::text()'
However, the use of the descendant-or-self
axis means that a ::text
can match all text nodes in the a
element, including those in other elements nested within the a
. In the following example, a ::text
will match two text nodes: 'Link '
followed by 'text'
:
<a href="https://example.com">Link <span>text</span></a>
So while Scrapy's implementation of ::text
is an egregious violation of the Selectors grammar, it seems to have been done this way very much intentionally.
In fact, Scrapy's other pseudo-element ::attr()
1 behaves similarly. The following selectors all match the id
attribute node belonging to the div
element when it does not have any descendant elements:
>>> css2xpath('div::attr(id)')
'descendant-or-self::div/@id'
>>> css2xpath('div ::attr(id)')
'descendant-or-self::div/descendant-or-self::*/@id'
>>> css2xpath('div *::attr(id)')
'descendant-or-self::div/descendant-or-self::*/@id'
... but div ::attr(id)
and div *::attr(id)
will match all id
attribute nodes within the div
's descendants along with its own id
attribute, such as in the following example:
<div id="parent"><p id="child"></p></div>
This, of course, is a much less plausible use case, so one has to wonder if this was an unintentional side effect of the implementation of ::text
.
Compare the pseudo-element selectors to one that substitutes any simple selector for the pseudo-element:
>>> css2xpath('a [href]')
'descendant-or-self::a/descendant-or-self::*/*[@href]'
This correctly translates the descendant combinator to descendant-or-self::*/*
with an additional implicit child
axis, ensuring that the [@href]
predicate is never tested on the a
element.
If you're new to XPath, Selectors, or even Scrapy, this may all seem very confusing and overwhelming. So here's a summary of when to use one selector over the other:
Use a::text
if your a
element contains only text, or if you're only interested in the top-level text nodes of this a
element and not its nested elements.
Use a ::text
if your a
element contains nested elements and you want to extract all the text nodes within this a
element.
While you can use a ::text
if your a
element contains only text, its syntax is confusing, so for the sake of consistency, use a::text
instead.
1On an interesting note, ::attr()
appears in the (abandoned as of 2021) Non-element Selectors spec, where as you'd expect it behaves consistently with the Selectors grammar, making its behavior in Scrapy inconsistent with the spec. ::text
on the other hand is conspicuously missing from the spec; based on this answer, I think you can make a reasonable guess as to why.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With