I have this code <pre class="prettyprint"><code> site = hxs.select("//h1[@class='state']") log.msg(str(site[0].extract()),level=log.ERROR) </code></pre> The ouput is <pre class="prettyprint"><code> [scrapy] ERROR: <h1 class="state"> 1 job containing php in region paying $30-40k per year </h1> </code></pre> Is it possible to only get the text without any html tags

<pre class="prettyprint"><code>//h1[@class='state'] </code></pre> in your above xpath you are selecting <code>h1</code> tag that has <code>class</code> attribute <code>state</code> so that's why it's selecting everything that comes in <code>h1 element</code> if you just want to select text of <code>h1</code> tag all you have to do is <pre class="prettyprint"><code>//h1[@class='state']/text() </code></pre> if you want to select text of <code>h1</code> tag as well as its children tags, you have to use <pre class="prettyprint"><code>//h1[@class='state']//text() </code></pre> so the difference is <code>/text()</code> for specific tag text and <code>//text()</code> for text of specific tag as well as its children tags below mentioned code works for you <pre class="prettyprint"><code>site = ''.join(hxs.select("//h1[@class='state']/text()").extract()).strip() </code></pre>

How can i extract only text in scrapy selector in python

Tags:

I have this code

   site = hxs.select("//h1[@class='state']")    log.msg(str(site[0].extract()),level=log.ERROR)

The ouput is

 [scrapy] ERROR: <h1 class="state"><strong>             1</strong>             <span> job containing <strong>php</strong> in <strong>region</strong> paying  <strong>$30-40k per year</strong></span>                 </h1>

Is it possible to only get the text without any html tags

618

asked Nov 21 '12 08:11

Mirage

1 Answers

//h1[@class='state']

in your above xpath you are selecting h1 tag that has class attribute state

so that's why it's selecting everything that comes in h1 element

if you just want to select text of h1 tag all you have to do is

//h1[@class='state']/text()

if you want to select text of h1 tag as well as its children tags, you have to use

//h1[@class='state']//text()

so the difference is /text() for specific tag text and //text() for text of specific tag as well as its children tags

below mentioned code works for you

site = ''.join(hxs.select("//h1[@class='state']/text()").extract()).strip()

188

answered Oct 08 '22 13:10

akhter wahab

Related questions
                            
                                Can't stroke path after filling it
                            
                                PHP mb_substr() not working correctly?
                            
                                why instanceof does not work with Generic? [duplicate]
                            
                                iOS stopped asking user for Photo Library Permission
                            
                                Catching Error when using Task.Factory
                            
                                Cannot create 2.3.3 Intel Atom AVD (userdata.img not found)
                            
                                duplicate vector into matrix r
                            
                                How to remove multiple items from unordered map while iterating over it?
                            
                                Binding a promise handler function to an object
                            
                                How can I disable a specific warning for a C# project in VS2012?
                            
                                Where to place Blade::extend
                            
                                .OrderBy(DayOfWeek) to treat Sunday as the end of the week

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With