I am trying to parse HTML using
a = lxml.html.fromstring('<html><body><span class="cut cross">Text of double class</span><span class="cross">Text of single class</span></body></html>')
s1 = a.xpath('.//span[@class="cross"]')
s2 = a.xpath('.//span[@class="cut cross"]')
s3 = a.xpath('.//span[@class="cut"]')
Output:
s1 => [<Element span at 0x7f0a6807a530>]
s2 => [<Element span at 0x7f0a6807a590>]
s3 => []
But the first span tag has class 'cut', yet s3 is empty. While in s2, when I give both classes, it returns the tag.
lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).
lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers. This is when the lxml library comes to play.
lxml module of Python is an XML toolkit that is basically a Pythonic binding of the following two C libraries: libxlst and libxml2. lxml module is a very unique and special module of Python as it offers a combination of XML features and speed.
XPaths equal operator matches exactly the right and left operands.
If you want to search for one of the class, you can use the contains
function :
a.xpath('.//span[contains(@class, "cut")]')
However, it can also matches a class like cut2
.
cssselect is a library that handles CSS selectors. A wrapper named pyquery mimics the JQuery library in python.
I'm pretty sure the CSS data model (i.e. classes are space-separated values in a single class
attribute) isn't adhered to for XPath queries. In order to do what you want, you should look at using CSS selectors (for example, via cssselect).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With