Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python lxml different result on windows and linux

Linux

>>> from lxml import etree
>>> html='''<td><a href=''>a1</a></td>
... <td><a href=''>a2</a></td>
... '''
>>> p=etree.HTML(html)
>>> a=p.xpath("//a[1]")
>>> for i in a:
...    print i.text
... 
a1
a2

windows.

>>> html='''<td><a href=''>a1</a></td>
... <td><a href=''>a2</a></td>
... '''
>>> from lxml import etree
>>> p=etree.HTML(html)
>>> a=p.xpath("//a[1]")
>>> for i in a:
...    print i.text
...
a1
>>> b=p.xpath("//a[2]")
>>> for i in b:
...    print i.text
...
a2

In Windows, I can easily to use a[1] and a[2] to get those two value. But in Linux, xpath //a[1] get those two link text together.

This make the program not so compatible in those OS. I have to modify code on different OS. Is it a lxml module bug ? Any solution for this ?

like image 358
Niuya Avatar asked Jun 06 '14 05:06

Niuya


People also ask

Is XML and lxml are same?

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers.

Is lxml faster than BeautifulSoup?

parser is written in pure python and slow. The internet is unanimous, one must install and use lxml alongside BeautifulSoup. lxml is a C parser that should be much much faster.

Is lxml standard Python?

lxml is not written in plain Python, because it interfaces with two C libraries: libxml2 and libxslt.

Are lxml libxml2 and Libxslt installed?

Unless you are using a static binary distribution (e.g. from a Windows binary installer), lxml requires libxml2 and libxslt to be installed, in particular: libxml2 version 2.9. 2 or later.


1 Answers

I can confirm the same result on Linux as you report. It returns a list of two elements instead of 1 single element.

What is xpath //a[1] asking for

It is asking for any a element which is first within it's context.

As you have a element embedded inside of td, td is the context for calculating the position and there are two occurrences of such situation.

Changing xpath to "(//a)[1]" resolves the problem.

Quoting from MSDN on Operators and Special Characters

The filter pattern operators ([]) have a higher precedence than the path operators (/ and //). For example, the expression //comment()[3] selects all comments with an index equal to 3 relative to the comment's parent anywhere in the document. This differs from the expression (//comment())[3], which selects the third comment from the set of all comments relative to the parent. The first expression can return more than one comment, while the latter can return only one comment.

Downgrade broken Windows lxml version 3.3.5

xpath //a[1] returning only one element of provided document is simply wrong and shall be reported to lxml authors.

Status of lxml on diferent platfoms and OS:

  • Win: lxml 2.3.0 - OK
  • Win: lxml 3.3.5 - BUG
  • Lin: lxml 3.3.5 - OK
  • Lin: lxml 2.3.0 - OK

To make your solution portable, you shall require lxml==2.3.0 as this version behaves on Windows as well as on Linux correctly (there might be another version working well on both platforms, I did not test more).

Bonus - test suite

Assuming you have installed nose

$ pip install nose

You can use following test_xpath.py:

from lxml import etree
import nose

print "=================================="
print "lxml version: ", etree.__version__
print "=================================="

def test_html():
    html_str = """
    <td><a href=''>a1</a></td>
    <td><a href=''>a2</a></td>
    """
    doc = etree.HTML(html_str.strip())
    elms = doc.xpath("//a[1]")
    assert len(elms) == 2, """xpath `//a[1]` shall return 2 elements"""
    assert all(elm.tag == "a" for elm in elms), "all returned elements shall be `a`"
    assert elms[0].text == "a1"
    assert elms[1].text == "a2"

def test_xml():
    xml_str = """
    <root>
        <td><a href=''>a1</a></td>
        <td><a href=''>a2</a></td>
    </root>
    """
    doc = etree.fromstring(xml_str.strip())
    elms = doc.xpath("//a[1]")
    assert len(elms) == 2, """xpath `//a[1]` shall return 2 elements"""
    assert all(elm.tag == "a" for elm in elms), "all returned elements shall be `a`"
    assert elms[0].text == "a1"
    assert elms[1].text == "a2"

nose.main()

and perform a test quickly:

$ python test_xpath.py  -v
==================================
lxml version:  2.3.0
==================================
test_xpath.test_html ... ok
test_xpath.test_xml ... ok

----------------------------------------------------------------------
Ran 2 tests in 0.002s

OK
like image 82
Jan Vlcinsky Avatar answered Sep 18 '22 02:09

Jan Vlcinsky