Linux
>>> from lxml import etree
>>> html='''<td><a href=''>a1</a></td>
... <td><a href=''>a2</a></td>
... '''
>>> p=etree.HTML(html)
>>> a=p.xpath("//a[1]")
>>> for i in a:
... print i.text
...
a1
a2
windows.
>>> html='''<td><a href=''>a1</a></td>
... <td><a href=''>a2</a></td>
... '''
>>> from lxml import etree
>>> p=etree.HTML(html)
>>> a=p.xpath("//a[1]")
>>> for i in a:
... print i.text
...
a1
>>> b=p.xpath("//a[2]")
>>> for i in b:
... print i.text
...
a2
In Windows, I can easily to use a[1]
and a[2]
to get those two value.
But in Linux, xpath //a[1]
get those two link text together.
This make the program not so compatible in those OS. I have to modify code on different OS. Is it a lxml module bug ? Any solution for this ?
lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers.
parser is written in pure python and slow. The internet is unanimous, one must install and use lxml alongside BeautifulSoup. lxml is a C parser that should be much much faster.
lxml is not written in plain Python, because it interfaces with two C libraries: libxml2 and libxslt.
Unless you are using a static binary distribution (e.g. from a Windows binary installer), lxml requires libxml2 and libxslt to be installed, in particular: libxml2 version 2.9. 2 or later.
I can confirm the same result on Linux as you report. It returns a list of two elements instead of 1 single element.
//a[1]
asking forIt is asking for any a
element which is first within it's context.
As you have a
element embedded inside of td
, td
is the context for calculating the position and there are two occurrences of such situation.
Changing xpath to "(//a)[1]"
resolves the problem.
Quoting from MSDN on Operators and Special Characters
The filter pattern operators ([]) have a higher precedence than the path operators (/ and //). For example, the expression //comment()[3] selects all comments with an index equal to 3 relative to the comment's parent anywhere in the document. This differs from the expression (//comment())[3], which selects the third comment from the set of all comments relative to the parent. The first expression can return more than one comment, while the latter can return only one comment.
xpath //a[1]
returning only one element of provided document is simply wrong and shall be reported to lxml authors.
Status of lxml on diferent platfoms and OS:
To make your solution portable, you shall require lxml==2.3.0
as this version behaves on Windows as well as on Linux correctly (there might be another version working well on both platforms, I did not test more).
Assuming you have installed nose
$ pip install nose
You can use following test_xpath.py
:
from lxml import etree
import nose
print "=================================="
print "lxml version: ", etree.__version__
print "=================================="
def test_html():
html_str = """
<td><a href=''>a1</a></td>
<td><a href=''>a2</a></td>
"""
doc = etree.HTML(html_str.strip())
elms = doc.xpath("//a[1]")
assert len(elms) == 2, """xpath `//a[1]` shall return 2 elements"""
assert all(elm.tag == "a" for elm in elms), "all returned elements shall be `a`"
assert elms[0].text == "a1"
assert elms[1].text == "a2"
def test_xml():
xml_str = """
<root>
<td><a href=''>a1</a></td>
<td><a href=''>a2</a></td>
</root>
"""
doc = etree.fromstring(xml_str.strip())
elms = doc.xpath("//a[1]")
assert len(elms) == 2, """xpath `//a[1]` shall return 2 elements"""
assert all(elm.tag == "a" for elm in elms), "all returned elements shall be `a`"
assert elms[0].text == "a1"
assert elms[1].text == "a2"
nose.main()
and perform a test quickly:
$ python test_xpath.py -v
==================================
lxml version: 2.3.0
==================================
test_xpath.test_html ... ok
test_xpath.test_xml ... ok
----------------------------------------------------------------------
Ran 2 tests in 0.002s
OK
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With