Can I supply a URL to lxml.etree.parse on Python 3?

Tags:

lxml

The documentation says I can:

lxml can parse from a local file, an HTTP URL or an FTP URL. It also auto-detects and reads gzip-compressed XML files (.gz).

(from http://lxml.de/parsing.html under "Parsers")

but a quick experiment seems to imply otherwise:

Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 10:45:13) [MSC v.1600 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml import etree
>>> parser = etree.HTMLParser()
>>> from urllib.request import urlopen
>>> with urlopen('https://pypi.python.org/simple') as f:
...   tree = etree.parse(f, parser)
...
>>> tree2 = etree.parse('https://pypi.python.org/simple', parser)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lxml.etree.pyx", line 3299, in lxml.etree.parse (src\lxml\lxml.etree.c:72655)
  File "parser.pxi", line 1791, in lxml.etree._parseDocument (src\lxml\lxml.etree.c:106263)
  File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src\lxml\lxml.etree.c:106564)
  File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src\lxml\lxml.etree.c:105561)
  File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src\lxml\lxml.etree.c:100456)
  File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:94543)
  File "parser.pxi", line 690, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:96003)
  File "parser.pxi", line 618, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:95015)
OSError: Error reading file 'https://pypi.python.org/simple': failed to load external entity "https://pypi.python.org/simple"
>>>

I can use the urlopen method, but the documentation seems to imply that passing a URL is somehow better. Also, I'm a bit concerned about relying on lxml if the documentation is inaccurate, particularly if I start needing to do anything more complex.

What is the correct way to parse HTML with lxml, from a known URL? And where should I be looking to see that documented?

Update: I get the same error if I use a http URL rather than a https one.

558

asked Oct 02 '14 14:10

Paul Moore

1 Answers

The issue is that lxml does not support HTTPS urls, and http://pypi.python.org/simple redirects to a HTTPS version.

So for any secure website, you need to read the URL yourself:

from lxml import etree
from urllib.request import urlopen

parser = etree.HTMLParser()

with urlopen('https://pypi.python.org/simple') as f:
    tree = etree.parse(f, parser)

answered Sep 27 '22 17:09

Paul Moore

Related questions
                            
                                Documenting a Python function with a docstring that is both readable in raw form and produces good sphinx output
                            
                                Flask and Python how to make search engine for data from mysql database
                            
                                Python’s argh library: preserve docstring formatting in help message
                            
                                How can I sort a list of Python classes by inheritance depth?
                            
                                Serving static files from root of Django development server
                            
                                Timer cannot restart after it is being stopped in Python
                            
                                How can I use the automatically created implicit through model class in Django in a ForeignKey field?
                            
                                Celery-Django as Daemon: Settings not found
                            
                                Django Ajax Submission with validation and multiple forms handling
                            
                                correct way of using os.path.join() in python
                            
                                Get a list of values from a list of dictionaries?
                            
                                Change default arguments of function in python
                            
                                Why is __len__() called implicitly on a custom iterator
                            
                                Tkinter after_cancel in python
                            
                                Move and zoom a tkinter canvas with mouse
                            
                                Element-wise matrix multiplication in NumPy
                            
                                QQuickView only supports loading of root objects that derive from QQuickItem error?
                            
                                Installing python server for emacs-jedi
                            
                                How to have a percentage chance of a command to run
                            
                                Importing a CSV file in pandas into a pandas dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Can I supply a URL to lxml.etree.parse on Python 3?

Tags:

python

lxml

Paul Moore

People also ask

1 Answers

Paul Moore

Recent Activity

Donate For Us