My html file has the following line
<tr><td>$nbsp;</td><tr>
But when I do the parsing with lxml:
from lxml import tree as ET
tree = ET.parse("file.html")
I get the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 3310, in lxml.etree.parse (src/lxml/lxml.etree.c:72517)
File "parser.pxi", line 1791, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:105979)
File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:106278)
File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:105277)
File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:100227)
File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:94350)
File "parser.pxi", line 690, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:95786)
File "parser.pxi", line 620, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:94853)
lxml.etree.XMLSyntaxError: Entity 'nbsp' not defined, line 14, column 159
Use lxml.html
, not lxml.etree
, for HTML.
is legitimately not predefined in XML, but it's available for HTML. Thus:
>>> lxml.html.fromstring('''<tr><td> </td><tr>''')
<Element div at 0x10a7a5e68>
...works properly.
Alternately, you can use the XML equivalent for
, which is  
, in your document, or you can declare a DOCTYPE
in your XML file and include <!ENTITY nbsp " ">
in its contents.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With