I am novice to lxml. I want to download the web page and get interested data from, my code is: <pre class="prettyprint"><code>import urllib2 from lxml import etree url = "http://www.example.com/" html = urllib2.urlopen(url) root = etree.parse(html) # the problem is here </code></pre> can anyone explain me why it is wrong? error is: <pre class="prettyprint"><code>Traceback (most recent call last): File "yatego.py", line 10, in <module> root = etree.parse(html) File "lxml.etree.pyx", line 2942, in lxml.etree.parse (src/lxml/lxml.etree.c:54187) File "parser.pxi", line 1550, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:79703) File "parser.pxi", line 1580, in lxml.etree._parseFilelikeDocument (src/lxml/lxml.etree.c:80012) File "parser.pxi", line 1463, in lxml.etree._parseDocFromFilelike (src/lxml/lxml.etree.c:78908) File "parser.pxi", line 1019, in lxml.etree._BaseParser._parseDocFromFilelike (src/lxml/lxml.etree.c:75905) File "parser.pxi", line 564, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:71739) File "parser.pxi", line 645, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:72614) File "parser.pxi", line 585, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:71955) lxml.etree.XMLSyntaxError: Entity 'mdash' not defined, line 4, column 21 </code></pre> This code: <pre class="prettyprint"><code>url = "http://www.example.com/" res = requests.get(url) doc = lxml.html.parse(res.content) </code></pre> gives this error: <pre class="prettyprint"><code>File "yatego.py", line 11, in <module> doc = lxml.html.parse(res.content) File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 692, in parse return etree.parse(filename_or_url, parser, base_url=base_url, **kw) File "lxml.etree.pyx", line 2942, in lxml.etree.parse (src/lxml/lxml.etree.c:54187) File "parser.pxi", line 1528, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:79485) File "parser.pxi", line 1557, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:79768) File "parser.pxi", line 1457, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:78843) File "parser.pxi", line 997, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:75698) File "parser.pxi", line 564, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:71739) File "parser.pxi", line 645, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:72614) File "parser.pxi", line 583, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:71927) IOError: Error reading file '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>IANA &mdash; Example domains</title> </code></pre> This code: <pre class="prettyprint"><code>doc = lxml.html.parse(url) </code></pre> works fine So where is the problem?

You should use <code>lxml.html</code> to parse HTML instead of <code>lxml.etree</code>. You can also open the url directly with <code>lxml</code>: <pre class="prettyprint"><code>doc = lxml.html.parse(url) </code></pre> Sometimes <code>lxml</code> will have trouble dealing with HTTP's quirks, in which case you'd need to use a more robust solution to fetch pages, like <code>requests</code>: <pre class="prettyprint"><code>res = requests.get(url) doc = lxml.html.parse(res.content) </code></pre>

lxml in python, parse from url

Tags:

python

python-2.7

lxml

I am novice to lxml. I want to download the web page and get interested data from, my code is:

import urllib2
from lxml import etree

url = "http://www.example.com/"

html = urllib2.urlopen(url)

root = etree.parse(html) # the problem is here

can anyone explain me why it is wrong?

error is:

Traceback (most recent call last):
  File "yatego.py", line 10, in <module>
    root = etree.parse(html)
  File "lxml.etree.pyx", line 2942, in lxml.etree.parse (src/lxml/lxml.etree.c:54187)
  File "parser.pxi", line 1550, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:79703)
  File "parser.pxi", line 1580, in lxml.etree._parseFilelikeDocument (src/lxml/lxml.etree.c:80012)
  File "parser.pxi", line 1463, in lxml.etree._parseDocFromFilelike (src/lxml/lxml.etree.c:78908)
  File "parser.pxi", line 1019, in lxml.etree._BaseParser._parseDocFromFilelike (src/lxml/lxml.etree.c:75905)
  File "parser.pxi", line 564, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:71739)
  File "parser.pxi", line 645, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:72614)
  File "parser.pxi", line 585, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:71955)
lxml.etree.XMLSyntaxError: Entity 'mdash' not defined, line 4, column 21

This code:

url = "http://www.example.com/"

res = requests.get(url)
doc = lxml.html.parse(res.content)

gives this error:

File "yatego.py", line 11, in <module>
    doc = lxml.html.parse(res.content)
  File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 692, in parse
    return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
  File "lxml.etree.pyx", line 2942, in lxml.etree.parse (src/lxml/lxml.etree.c:54187)
  File "parser.pxi", line 1528, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:79485)
  File "parser.pxi", line 1557, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:79768)
  File "parser.pxi", line 1457, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:78843)
  File "parser.pxi", line 997, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:75698)
  File "parser.pxi", line 564, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:71739)
  File "parser.pxi", line 645, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:72614)
  File "parser.pxi", line 583, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:71927)
IOError: Error reading file '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
    <title>IANA &mdash; Example domains</title>

This code:

doc = lxml.html.parse(url)

works fine

So where is the problem?

932

asked Mar 20 '12 09:03

user873286

1 Answers

You should use lxml.html to parse HTML instead of lxml.etree.

You can also open the url directly with lxml:

doc = lxml.html.parse(url)

Sometimes lxml will have trouble dealing with HTTP's quirks, in which case you'd need to use a more robust solution to fetch pages, like requests:

res = requests.get(url)
doc = lxml.html.parse(res.content)

103

answered Sep 28 '22 19:09

zeekay

Related questions
                            
                                Python constants declaration
                            
                                In Pyramid, how can I use a different renderer based on contents of context?
                            
                                Multiprocessing useless with urllib2?
                            
                                How to use joinedload/contains_eager for query-enabled relationships (lazy='dynamic' option) in SQLAlchemy
                            
                                Python logging: reverse effects of disable()
                            
                                Translate Perl to Python: do this or die
                            
                                Python duck-typing for MVC event handling in pygame
                            
                                Celery tasks profiling
                            
                                How can I reconnect a socket after a broken pipe?
                            
                                Numpy C-Api example gives a SegFault
                            
                                Can't run Python .py files from terminal on Mac
                            
                                Python XML Parsing [duplicate]
                            
                                Python: What's the fastest way to zip right to left, and is there no builtin for this?
                            
                                User authentication in tornado websocket application
                            
                                How to join links in Python to get a cycle?
                            
                                How does python's random.Random.seed work?
                            
                                "Can't get a consistent path to setup script from installation directory"
                            
                                Recommendations for Low Discrepancy (e.g. Sobol) quasi-random sequences in Python/SciPy?
                            
                                nose plugin for expected-failures
                            
                                Scrapyd jobid value inside spider

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With