I am using lxml to parse html files given urls. For example: <pre class="prettyprint"><code>link = 'https://abc.com/def' htmltree = lxml.html.parse(link) </code></pre> My code is working well for most of the cases, the ones with <code>http://</code>. However, I found for every <code>https://</code> url, lxml simply gets an IOError. Does anyone know the reason? And possibly, how to correct this problem? BTW, I want to stick to lxml than switch to BeautifulSoup given I've already got a quick finished programme.

I don't know what's happening, but I get the same errors. HTTPS is probably not supported. You can easily work around this with <code>urllib2</code>, though: <pre class="prettyprint"><code>from lxml import html from urllib2 import urlopen html.parse(urlopen('https://duckduckgo.com')) </code></pre>

From the <code>lxml</code> documentation: <blockquote> lxml can parse from a local file, an HTTP URL or an FTP URL </blockquote> I don't see HTTPS in that sentence anywhere, so I assume it is not supported. An easy workaround would be to retrieve the file using some other library that does support HTTPS, such as <code>urllib2</code>, and pass the retrieved document as a string to <code>lxml</code>.

What is the deal about https when using lxml?

Tags:

python

parsing

lxml

I am using lxml to parse html files given urls.

For example:

link = 'https://abc.com/def'
htmltree = lxml.html.parse(link)

My code is working well for most of the cases, the ones with http://. However, I found for every https:// url, lxml simply gets an IOError. Does anyone know the reason? And possibly, how to correct this problem?

BTW, I want to stick to lxml than switch to BeautifulSoup given I've already got a quick finished programme.

449

asked Oct 24 '11 22:10

Flake

2 Answers

I don't know what's happening, but I get the same errors. HTTPS is probably not supported. You can easily work around this with urllib2, though:

from lxml import html
from urllib2 import urlopen

html.parse(urlopen('https://duckduckgo.com'))

189

answered Oct 04 '22 09:10

Fred Foo

From the lxml documentation:

lxml can parse from a local file, an HTTP URL or an FTP URL

I don't see HTTPS in that sentence anywhere, so I assume it is not supported.

An easy workaround would be to retrieve the file using some other library that does support HTTPS, such as urllib2, and pass the retrieved document as a string to lxml.

answered Oct 04 '22 11:10

kindall

Related questions
                            
                                How do I check when my next Airflow DAG run has been scheduled for a specific dag?
                            
                                Validating input when mutating a dataclass
                            
                                PyTorch torch.max over multiple dimensions
                            
                                Could not build wheels for _ which use PEP 517 and cannot be installed directly - Easy Solution
                            
                                Experiences of creating Social Network site in Django
                            
                                What permissions are required for subprocess.Popen?
                            
                                Listing installed python site-packages? [duplicate]
                            
                                Python time objects with more than 24 hours
                            
                                Python reclaiming memory after deleting items in a dictionary
                            
                                Python: list comprehension, do f(x) if x exists?
                            
                                Numpy *.npz internal file structure
                            
                                How to run 'python setup.py install' from within Python?
                            
                                django query based on dynamic property()
                            
                                Migrating to pip+virtualenv from setuptools
                            
                                Python Run a daemon sub-process & read stdout
                            
                                Python: Return 2 ints for index in 2D lists given item
                            
                                Improve speed of reading and converting from binary file?
                            
                                What is a good strategy to group similar words?
                            
                                Passing bash variables to a script?
                            
                                Is there a reliable way to determine the system CPU architecture using Python? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With