I'm using Scrapy library to crawl a webpage. But I have a problem. I do not know how to target <code>data</code> attribute. I have an link with <code>data</code> attribute and <code>href</code> as follows: <pre class="prettyprint"><code><a data-item-name="detail-page-link" href="this-is-some-link"> </code></pre> What I want is the value of <code>href</code>. If <code>a</code> had class I could do it as follows: <pre class="prettyprint"><code>response.css('.some-class::attr(href)') </code></pre> But the problem is that I do not know how to target <code>data-item-name</code> attribute. Any advice?

Using scrapy <code>css</code> selector, you can do : <pre class="prettyprint"><code>response.css('a[data-item-name="detail-page-link"]::attr(href)').extract() </code></pre>

I'm not sure, if you can do this with the <code>css</code> method, but with the <code>xpath</code> method you should be able to do: <pre class="prettyprint"><code>response.xpath("//a[@data-item-name]/@href") </code></pre>

How to target data attribute with Scrapy

Tags:

python

scrapy

I'm using Scrapy library to crawl a webpage.

But I have a problem. I do not know how to target data attribute.

I have an link with data attribute and href as follows:

<a data-item-name="detail-page-link" href="this-is-some-link">

What I want is the value of href. If a had class I could do it as follows:

response.css('.some-class::attr(href)')

But the problem is that I do not know how to target data-item-name attribute.

Any advice?

335

asked Jun 07 '18 07:06

Boky

2 Answers

Using scrapy css selector, you can do :

response.css('a[data-item-name="detail-page-link"]::attr(href)').extract()

answered Oct 05 '22 13:10

Sijan Bhandari

I'm not sure, if you can do this with the css method, but with the xpath method you should be able to do:

response.xpath("//a[@data-item-name]/@href")

answered Oct 05 '22 13:10

xystum

Related questions
                            
                                How could I retrieve AWS Lambda public IP address by using Python?
                            
                                Regex: don't match string ending with newline (\n) with end-of-line anchor ($)
                            
                                Getting the keys of items with the least counts from a list of tuples of key-value pairs - Python
                            
                                Jinja ignores HTML comments [duplicate]
                            
                                PyCharm Vagrant Couldn't refresh skeletons for remote interpreter
                            
                                Resample Pandas With Minimum Required Number of Observations
                            
                                PyTorch - How to use "toPILImage" correctly
                            
                                Using asyncio to run a function at the start (00 seconds) of every minute
                            
                                How to continuously monitor a new mail in outlook and unread mails of a specific folder in python
                            
                                OpenCV MatchTemplate in C# is too slow compared to Python
                            
                                When would the python tracemalloc module allocations statistics not match what's shown in ps or pmap?
                            
                                Keras: How to get layer index when already know layer name?
                            
                                What does the parenthesis after the function mean
                            
                                django - prefetch only the newest record?
                            
                                How to extract rar files inside google colab
                            
                                What is the best way in python to write docstrings for lambda functions?
                            
                                Assign value to specific cell in PySpark dataFrame
                            
                                What does tqdm's total parameter do?
                            
                                Django and Folium integration
                            
                                How to pass additional parameters to handle_client coroutine?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With