I am using lxml to parse html files given urls.
For example:
link = 'https://abc.com/def'
htmltree = lxml.html.parse(link)
My code is working well for most of the cases, the ones with http://
. However, I found for every https://
url, lxml simply gets an IOError. Does anyone know the reason? And possibly, how to correct this problem?
BTW, I want to stick to lxml than switch to BeautifulSoup given I've already got a quick finished programme.
lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers. This is when the lxml library comes to play.
lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).
lxml aims to provide a Pythonic API by following as much as possible the ElementTree API. We're trying to avoid inventing too many new APIs, or you having to learn new things -- XML is complicated enough.
lxml can make use of BeautifulSoup as a parser backend, just like BeautifulSoup can employ lxml as a parser. When using BeautifulSoup from lxml, however, the default is to use Python's integrated HTML parser in the html. parser module.
I don't know what's happening, but I get the same errors. HTTPS is probably not supported. You can easily work around this with urllib2
, though:
from lxml import html
from urllib2 import urlopen
html.parse(urlopen('https://duckduckgo.com'))
From the lxml
documentation:
lxml can parse from a local file, an HTTP URL or an FTP URL
I don't see HTTPS in that sentence anywhere, so I assume it is not supported.
An easy workaround would be to retrieve the file using some other library that does support HTTPS, such as urllib2
, and pass the retrieved document as a string to lxml
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With