I'm trying to parse a webpage using lxml and I'm having trouble trying to bring back all the text elements within a div. Here's what I have so far...
import requests
from lxml import html
page = requests.get("https://www.goodeggs.com/sfbay/missionheirloom/seasonal-chicken-stew-16oz/53c68de974e06f020000073f",verify=False)
tree = html.fromstring(page.text)
foo = tree.xpath('//section[@class="product-description"]/div[@class="description-body"]/text()')
As of now "foo" brings back an empty list []. Other pages bring back some content, but not all of the content that is in tags within the <div>
. Other pages bring back all the content, because it is at the top level of the div.
How do I bring back all of the text content within that div? Thanks!
The text
is inside two <p>
tags, so part of the text is in each p.text
instead of in div.text
. However, you can pull all the text in all the children of <div>
by calling the text_content
method instead of using the XPath text()
:
import requests
import lxml.html as LH
url = ("https://www.goodeggs.com/sfbay/missionheirloom/"
"seasonal-chicken-stew-16oz/53c68de974e06f020000073f")
page = requests.get(url, verify=False)
root = LH.fromstring(page.text)
path = '//section[@class="product-description"]/div[@class="description-body"]'
for div in root.xpath(path):
print(div.text_content())
yields
We’re super excited about the changing seasons! Because the new season brings wonderful new ingredients, we’ll be changing the flavor profile of our stews. Starting with deliveries on Thursday October 9th, the Chicken and Wild Rice stew will be replaced with a Classic Chicken Stew. We’re sure you’ll love it!Mission: Heirloom is a food company based in Berkeley. All of our food is sourced as locally as possible and 100% organic or biodynamic. We never cook with refined oils, and our food is always gluten-free, grain-free, soy-free, peanut-free, legume-free, and added sugar-free.
PS. dfsq has already suggest using the XPath ...//text()
. That also works, but in contrast to text_content
, the pieces of text are returned as separate items:
In [256]: root = LH.fromstring('<a>FOO <b>BAR <c>QUX</c> </b> BAZ</a>')
In [257]: root.xpath('//a//text()')
Out[257]: ['FOO ', 'BAR ', 'QUX', ' ', ' BAZ']
In [258]: [a.text_content() for a in root.xpath('//a')]
Out[258]: ['FOO BAR QUX BAZ']
I think XPath expression should be:
//section[@class="product-description"]/div[@class="description-body"]//text()
UPD. As pointed by @unutbu above expression will fetch text nodes as a list, so you will have to loop over them. If you need entire text content as one text item, check unutbu's answer for other options.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With