Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup - lxml and html5lib parsers scraping differences

I am using BeautifulSoup 4 with Python 2.7. I would like to extract certain elements from a website (Quantities, see the example bellow). For some reason, the lxml parser doesn't allow me to extract all of the desired elements from the page. It would print the first three elements only. I am trying to use the html5lib parser to see if I can extract all of them.

The page contains multiple items, with their price and quantities. A portion of the code containing the desired information for each of the item looks like this:

<td class="size-price last first" colspan="4">
                    <span>453 grams </span>
            <span> <span class="strike">$619.06</span> <span class="price">$523.91</span>
                    </span>
                </td>

Let's consider the following three cases:

CASE 1 - DATA:

#! /usr/bin/python
from bs4 import BeautifulSoup
data = """
<td class="size-price last first" colspan="4">
                    <span>453 grams </span>
            <span> <span class="strike">$619.06</span> <span class="price">$523.91</span>
                    </span>
                </td>"""                
soup = BeautifulSoup(data)
print soup.td.span.text

Prints:

453 grams 

CASE 2 - LXML:

#! /usr/bin/python
from bs4 import BeautifulSoup
from urllib import urlopen
webpage = urlopen('The URL goes here')
soup=BeautifulSoup(webpage, "lxml")
print soup.find('td', {'class': 'size-price'}).span.text

Prints:

453 grams

CASE 3 - HTML5LIB:

#! /usr/bin/python
from bs4 import BeautifulSoup
from urllib import urlopen
webpage = urlopen('The URL goes here')
soup=BeautifulSoup(webpage, "html5lib")
print soup.find('td', {'class': 'size-price'}).span.text

I get the following error:

Traceback (most recent call last):
  File "C:\Users\Dom\Python-Code\src\Testing-Code.py", line 6, in <module>
    print soup.find('td', {'class': 'size-price'}).span.text
AttributeError: 'NoneType' object has no attribute 'span'

How do I have to adapt my code in order to extract the information that I want using the html5lib parser? I can see all of the desired information if I simply print the soup in the console after using the html5lib, so I figured it would allow me to get what I want. It is not the case for the lxml parser so I am also curious about the fact that the lxml parser doesn't seem to extract all of the Quantities using the lxml parser if I use:

print [td.span.text for td in soup.find_all('td', {'class': 'size-price'})]
like image 801
LaGuille Avatar asked Mar 27 '14 19:03

LaGuille


People also ask

Is lxml faster than BeautifulSoup?

It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superiour support for encoding detection. It very much depends on the input which parser works better. In the end they are saying, The downside of using this parser is that it is much slower than the HTML parser of lxml.

What does lxml do in BeautifulSoup?

To prevent users from having to choose their parser library in advance, lxml can interface to the parsing capabilities of BeautifulSoup through the lxml. html. soupparser module. It provides three main functions: fromstring() and parse() to parse a string or file using BeautifulSoup into an lxml.

Is lxml faster than HTML parser?

lxml is faster than html. parser or html5lib parser. This is because lxml parser that you will invoke in beautiful soup is natively written in C ( uses the libxml2 C library ) , hwere as the html. parser is written in python.

What is lxml parser?

lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).


1 Answers

from lxml import etree

html = 'your html'
tree = etree.HTML(html)
tds = tree.xpath('.//td[@class="size-price last first"]')
for td in tds:
    price = td.xpath('.//span[@class="price"]')[0].text
    strike = td.xpath('.//span[@class="strike"]')[0].text
    spans = td.xpath('.//span')
    quantity = [i.text for i in spans if 'grams' in i.text][0].strip(' ')
like image 57
AutomaticStatic Avatar answered Oct 02 '22 05:10

AutomaticStatic