I'm new in this sphere and i need more information. I couldn't find any information in the Internet. For example now now i use this function hxs.select('//div[@id="CategoryBreadcrumb"]//text()').extract() . In this div I have ul and lis with anchors in each li but one. I need the text from the li that doesn't have a tag in it. I'd be thankful if you give any educational links for hxs filtering as well. Thanks in advance!
Here is an example if u cant visualize what i need.
<div id='CategoryBreadcrumb'>
<ul>
<li><a href=#>I dont need</a></li>
<li><a href=#>I dont need</a></li>
<li><a href=#>I dont need</a></li>
<li>Text that i need</li>
</ul>
</div>
Try:
hxs.select('//div[@id = "CategoryBreadcrumb"]/ul/li/text()')
To learn more about XPaths see w3schools for the basics, and w3.org for the full specification.
PS: scrapy uses lxml. You can test your XPaths using code like this:
import lxml.html as LH
text = '''
<div id='CategoryBreadcrumb'>
<ul>
<li><a href=#>I dont need</a></li>
<li><a href=#>I dont need</a></li>
<li><a href=#>I dont need</a></li>
<li>Text that i need</li>
</ul>
</div>
'''
doc = LH.fromstring(text)
print(doc.xpath('//div[@id = "CategoryBreadcrumb"]/ul/li/text()'))
# ['Text that i need']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With