Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Excluding specific child node with XPath and Scrapy/lxml

I have some Python code that scrapes bbcode forums using scrapy, and I need an Xpath expression that gives me just the text of the posts, excluding the text from quotes. The HTML looks like this:

<td class="postbody">
   hi this is a response
   <div class="bbc-block">
      <blockquote>
         blah blah blah here's a quote
         <br>
      </blockquote>
   </div>
   <br>
   and now I'm responding to what I quoted
</td>
<td class="postbody">
   <div class="bbc-block">
      <blockquote>
         and now I'm responding to what I quoted
         <br>
      </blockquote>
   </div>
   <br>
   wow what a great response
</td>

This occurs many times per page, for each post. What I ultimately want is just the text for each of these td nodes with the blockquote excluded:

  1. hi this is a response \n and now I'm responding to what I quoted
  2. wow what a great response

The Python code I have to extract these blocks is as follows--first I converted it from scrapy's HtmlResponse to lxml's HtmlElement class, because that was the only way I could figure out to use the lxml.html.text_content() method:

import lxml.html as ht

def posts_from_response(self, response):
    dom = ht.fromstring(response.body)
    posts = dom.xpath('//td[@class="postbody"]')
    posts_text = [p.text_content() for p in posts]
    return posts_text

I've searched for solutions extensively for a few days, and tried about a dozen variations of

'//td[@class="postbody"][not(@class="bbc-block")]'

appended to that in various ways, but nothing gets me exactly what I want with the grouping that I want.

Is there either 1. a way to get this with a single statement, or 2. a way to execute a second Xpath selector on my posts list to exclude the bbc-block nodes?

like image 840
stuart Avatar asked Mar 03 '26 20:03

stuart


1 Answers

To get only text which is direct child of try:

//*[@class='postbody']/text()  

To get all text elements in td but ignore text inside div with class bbc-block':

 //td//text()[not(ancestor::*[@class='bbc-block'])]"
like image 98
hr_117 Avatar answered Mar 05 '26 10:03

hr_117