Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy Extract number from page text with regex

I have been looking for a few hours on how to search all text on a page and if it matches a regex then extract it. I have my spider set up as follows:

def parse(self, response):
        title = response.xpath('//title/text()').extract()
        units = response.xpath('//body/text()').re(r"Units: (\d)")
        print title, units

I would like to pull out the number after "Units: " on the pages. When I run scrapy on a page with Units: 351 in the body I only get the title of the page with a bunch of escapes before and after it and nothing for units.

I am new to scrapy and have a little python experience. Any help with how to extract the integer after Units: and remove the extra escape characters "u'\r\n\t..." from the title would be much appreciated.

EDIT: As per comment here is an partial html extract of an example page. Note this could be within different tags aside from the p in this example:

<body>
<div> Some content and multiple Divs here <div>
<h1>This is the count for Dala</h1>
<p><strong>Number of Units:</strong> 801</p>
<p>We will have other content here and more divs beyond</p>
</body>

Based on the answer below this is what got most of the way there. Still working on removing Units: and extra escape characters.

units = response.xpath('string(//body)').re("(Units: [\d]+)")
like image 621
Xaxum Avatar asked Nov 03 '14 21:11

Xaxum


1 Answers

Try:

response.xpath('string(//body)').re(r"Units: (\d)")
like image 175
Elias Dorneles Avatar answered Sep 24 '22 08:09

Elias Dorneles