Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HXS filtering with scrapy - python

I'm new in this sphere and i need more information. I couldn't find any information in the Internet. For example now now i use this function hxs.select('//div[@id="CategoryBreadcrumb"]//text()').extract() . In this div I have ul and lis with anchors in each li but one. I need the text from the li that doesn't have a tag in it. I'd be thankful if you give any educational links for hxs filtering as well. Thanks in advance! Here is an example if u cant visualize what i need.

<div id='CategoryBreadcrumb'>
<ul>
  <li><a href=#>I dont need</a></li>
  <li><a href=#>I dont need</a></li>
  <li><a href=#>I dont need</a></li>
  <li>Text that i need</li>
</ul>
</div>
like image 559
Martin Avatar asked Mar 04 '26 10:03

Martin


1 Answers

Try:

hxs.select('//div[@id = "CategoryBreadcrumb"]/ul/li/text()')

To learn more about XPaths see w3schools for the basics, and w3.org for the full specification.


PS: scrapy uses lxml. You can test your XPaths using code like this:

import lxml.html as LH

text = '''
<div id='CategoryBreadcrumb'>
<ul>
  <li><a href=#>I dont need</a></li>
  <li><a href=#>I dont need</a></li>
  <li><a href=#>I dont need</a></li>
  <li>Text that i need</li>
</ul>
</div>
'''

doc = LH.fromstring(text)
print(doc.xpath('//div[@id = "CategoryBreadcrumb"]/ul/li/text()'))

# ['Text that i need']
like image 123
unutbu Avatar answered Mar 06 '26 22:03

unutbu



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!