Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can i extract only text in scrapy selector in python

Tags:

I have this code

   site = hxs.select("//h1[@class='state']")    log.msg(str(site[0].extract()),level=log.ERROR) 

The ouput is

 [scrapy] ERROR: <h1 class="state"><strong>             1</strong>             <span> job containing <strong>php</strong> in <strong>region</strong> paying  <strong>$30-40k per year</strong></span>                 </h1> 

Is it possible to only get the text without any html tags

like image 618
Mirage Avatar asked Nov 21 '12 08:11

Mirage


People also ask

How do I get text from XPath in Scrapy?

When you are using text nodes in a XPath string function, then use . (dot) instead of using .//text(), because this produces the collection of text elements called as node-set.

How do you use the selector in Scrapy?

Description. When you are scraping the web pages, you need to extract a certain part of the HTML source by using the mechanism called selectors, achieved by using either XPath or CSS expressions. Selectors are built upon the lxml library, which processes the XML and HTML in Python language.

What is the Scrapy method that you can call to retrieve the contents of the selected node in XPath?

Scrapy comes with its own mechanism for extracting data. They're called selectors because they “select” certain parts of the HTML document specified either by XPath or CSS expressions.


1 Answers

//h1[@class='state'] 

in your above xpath you are selecting h1 tag that has class attribute state

so that's why it's selecting everything that comes in h1 element

if you just want to select text of h1 tag all you have to do is

//h1[@class='state']/text() 

if you want to select text of h1 tag as well as its children tags, you have to use

//h1[@class='state']//text() 

so the difference is /text() for specific tag text and //text() for text of specific tag as well as its children tags

below mentioned code works for you

site = ''.join(hxs.select("//h1[@class='state']/text()").extract()).strip() 
like image 188
akhter wahab Avatar answered Oct 08 '22 13:10

akhter wahab