I have some Python code that scrapes bbcode forums using scrapy, and I need an Xpath expression that gives me just the text of the posts, excluding the text from quotes. The HTML looks like this:
<td class="postbody">
hi this is a response
<div class="bbc-block">
<blockquote>
blah blah blah here's a quote
<br>
</blockquote>
</div>
<br>
and now I'm responding to what I quoted
</td>
<td class="postbody">
<div class="bbc-block">
<blockquote>
and now I'm responding to what I quoted
<br>
</blockquote>
</div>
<br>
wow what a great response
</td>
This occurs many times per page, for each post. What I ultimately want is just the text for each of these td nodes with the blockquote excluded:
The Python code I have to extract these blocks is as follows--first I converted it from scrapy's HtmlResponse to lxml's HtmlElement class, because that was the only way I could figure out to use the lxml.html.text_content() method:
import lxml.html as ht
def posts_from_response(self, response):
dom = ht.fromstring(response.body)
posts = dom.xpath('//td[@class="postbody"]')
posts_text = [p.text_content() for p in posts]
return posts_text
I've searched for solutions extensively for a few days, and tried about a dozen variations of
'//td[@class="postbody"][not(@class="bbc-block")]'
appended to that in various ways, but nothing gets me exactly what I want with the grouping that I want.
Is there either 1. a way to get this with a single statement, or 2. a way to execute a second Xpath selector on my posts list to exclude the bbc-block nodes?
To get only text which is direct child of try:
//*[@class='postbody']/text()
To get all text elements in td but ignore text inside div with class bbc-block':
//td//text()[not(ancestor::*[@class='bbc-block'])]"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With