Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

lxml not properly parsing tags with multiple classes

Tags:

python

lxml

I am trying to parse HTML using

a = lxml.html.fromstring('<html><body><span class="cut cross">Text of double class</span><span class="cross">Text of single class</span></body></html>')
s1 = a.xpath('.//span[@class="cross"]')
s2 = a.xpath('.//span[@class="cut cross"]')
s3 = a.xpath('.//span[@class="cut"]')

Output:

s1 => [<Element span at 0x7f0a6807a530>]
s2 => [<Element span at 0x7f0a6807a590>]
s3 => []

But the first span tag has class 'cut', yet s3 is empty. While in s2, when I give both classes, it returns the tag.

like image 883
WeaklyTyped Avatar asked Jan 21 '13 15:01

WeaklyTyped


People also ask

Can lxml parse HTML?

lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).

Is XML and lxml are same?

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers. This is when the lxml library comes to play.

What is lxml in Python?

lxml module of Python is an XML toolkit that is basically a Pythonic binding of the following two C libraries: libxlst and libxml2. lxml module is a very unique and special module of Python as it offers a combination of XML features and speed.


2 Answers

XPaths equal operator matches exactly the right and left operands. If you want to search for one of the class, you can use the contains function :

a.xpath('.//span[contains(@class, "cut")]')

However, it can also matches a class like cut2.

cssselect is a library that handles CSS selectors. A wrapper named pyquery mimics the JQuery library in python.

like image 161
Scharron Avatar answered Oct 14 '22 13:10

Scharron


I'm pretty sure the CSS data model (i.e. classes are space-separated values in a single class attribute) isn't adhered to for XPath queries. In order to do what you want, you should look at using CSS selectors (for example, via cssselect).

like image 20
djc Avatar answered Oct 14 '22 13:10

djc