Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I supply a URL to lxml.etree.parse on Python 3?

Tags:

python

lxml

The documentation says I can:

lxml can parse from a local file, an HTTP URL or an FTP URL. It also auto-detects and reads gzip-compressed XML files (.gz).

(from http://lxml.de/parsing.html under "Parsers")

but a quick experiment seems to imply otherwise:

Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 10:45:13) [MSC v.1600 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml import etree
>>> parser = etree.HTMLParser()
>>> from urllib.request import urlopen
>>> with urlopen('https://pypi.python.org/simple') as f:
...   tree = etree.parse(f, parser)
...
>>> tree2 = etree.parse('https://pypi.python.org/simple', parser)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lxml.etree.pyx", line 3299, in lxml.etree.parse (src\lxml\lxml.etree.c:72655)
  File "parser.pxi", line 1791, in lxml.etree._parseDocument (src\lxml\lxml.etree.c:106263)
  File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src\lxml\lxml.etree.c:106564)
  File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src\lxml\lxml.etree.c:105561)
  File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src\lxml\lxml.etree.c:100456)
  File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:94543)
  File "parser.pxi", line 690, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:96003)
  File "parser.pxi", line 618, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:95015)
OSError: Error reading file 'https://pypi.python.org/simple': failed to load external entity "https://pypi.python.org/simple"
>>>

I can use the urlopen method, but the documentation seems to imply that passing a URL is somehow better. Also, I'm a bit concerned about relying on lxml if the documentation is inaccurate, particularly if I start needing to do anything more complex.

What is the correct way to parse HTML with lxml, from a known URL? And where should I be looking to see that documented?

Update: I get the same error if I use a http URL rather than a https one.

like image 558
Paul Moore Avatar asked Oct 02 '14 14:10

Paul Moore


People also ask

Is lxml included in Python?

lxml has been downloaded from the Python Package Index millions of times and is also available directly in many package distributions, e.g. for Linux or macOS.

Can lxml parse HTML?

lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).


1 Answers

The issue is that lxml does not support HTTPS urls, and http://pypi.python.org/simple redirects to a HTTPS version.

So for any secure website, you need to read the URL yourself:

from lxml import etree
from urllib.request import urlopen

parser = etree.HTMLParser()

with urlopen('https://pypi.python.org/simple') as f:
    tree = etree.parse(f, parser)
like image 79
Paul Moore Avatar answered Sep 27 '22 17:09

Paul Moore