Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What am I doing wrong? Parsing HTML using lxml

Tags:

python

html

lxml

I'm trying to parse a webpage using lxml and I'm having trouble trying to bring back all the text elements within a div. Here's what I have so far...

import requests
from lxml import html
page = requests.get("https://www.goodeggs.com/sfbay/missionheirloom/seasonal-chicken-stew-16oz/53c68de974e06f020000073f",verify=False)
tree = html.fromstring(page.text)
foo = tree.xpath('//section[@class="product-description"]/div[@class="description-body"]/text()')

As of now "foo" brings back an empty list []. Other pages bring back some content, but not all of the content that is in tags within the <div>. Other pages bring back all the content, because it is at the top level of the div.

How do I bring back all of the text content within that div? Thanks!

like image 554
jrubins Avatar asked Dec 20 '14 18:12

jrubins


2 Answers

The text is inside two <p> tags, so part of the text is in each p.text instead of in div.text. However, you can pull all the text in all the children of <div> by calling the text_content method instead of using the XPath text():

import requests
import lxml.html as LH
url = ("https://www.goodeggs.com/sfbay/missionheirloom/" 
       "seasonal-chicken-stew-16oz/53c68de974e06f020000073f")
page = requests.get(url, verify=False)
root = LH.fromstring(page.text)

path = '//section[@class="product-description"]/div[@class="description-body"]'
for div in root.xpath(path):
    print(div.text_content())

yields

We’re super excited about the changing seasons! Because the new season brings wonderful new ingredients, we’ll be changing the flavor profile of our stews. Starting with deliveries on Thursday October 9th, the Chicken and Wild Rice stew will be replaced with a Classic Chicken Stew. We’re sure you’ll love it!Mission: Heirloom is a food company based in Berkeley. All of our food is sourced as locally as possible and 100% organic or biodynamic. We never cook with refined oils, and our food is always gluten-free, grain-free, soy-free, peanut-free, legume-free, and added sugar-free.

PS. dfsq has already suggest using the XPath ...//text(). That also works, but in contrast to text_content, the pieces of text are returned as separate items:

In [256]: root = LH.fromstring('<a>FOO <b>BAR <c>QUX</c> </b> BAZ</a>')

In [257]: root.xpath('//a//text()')
Out[257]: ['FOO ', 'BAR ', 'QUX', ' ', ' BAZ']

In [258]: [a.text_content() for a in root.xpath('//a')]
Out[258]: ['FOO BAR QUX  BAZ']
like image 142
unutbu Avatar answered Sep 28 '22 07:09

unutbu


I think XPath expression should be:

//section[@class="product-description"]/div[@class="description-body"]//text()

UPD. As pointed by @unutbu above expression will fetch text nodes as a list, so you will have to loop over them. If you need entire text content as one text item, check unutbu's answer for other options.

like image 29
dfsq Avatar answered Sep 28 '22 07:09

dfsq