I have been looking for a few hours on how to search all text on a page and if it matches a regex then extract it. I have my spider set up as follows:
def parse(self, response):
title = response.xpath('//title/text()').extract()
units = response.xpath('//body/text()').re(r"Units: (\d)")
print title, units
I would like to pull out the number after "Units: " on the pages. When I run scrapy on a page with Units: 351 in the body I only get the title of the page with a bunch of escapes before and after it and nothing for units.
I am new to scrapy and have a little python experience. Any help with how to extract the integer after Units: and remove the extra escape characters "u'\r\n\t..." from the title would be much appreciated.
EDIT: As per comment here is an partial html extract of an example page. Note this could be within different tags aside from the p in this example:
<body>
<div> Some content and multiple Divs here <div>
<h1>This is the count for Dala</h1>
<p><strong>Number of Units:</strong> 801</p>
<p>We will have other content here and more divs beyond</p>
</body>
Based on the answer below this is what got most of the way there. Still working on removing Units: and extra escape characters.
units = response.xpath('string(//body)').re("(Units: [\d]+)")
Try:
response.xpath('string(//body)').re(r"Units: (\d)")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With