Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting XML illegal &char to utf8 - python

There is a list of XML and HTML character references at: https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references.

However there are things that aren't defined at all in that list but they were used in older HTML scripts. When I am processing the Senseval-2 format (with fixes) dataset from http://www.d.umn.edu/~tpederse/data.html, I encounter the following words where it breaks my script which tried to use xml.et.elementTree to parse the data.

What are the unicode equivalence of these words?

&and.
&and.A
&and.B
&and.D
&and.L's
&backquote.alim)
&backquote.ulema
&dash
&dash.
&dash."
&dashq.
&degree.
&degree.C
&ellip
&ellip.
&ellip.0
&ellip.1
&ellip.11
&ellip.2
&ellip.23
&ellip.28
&ellip.38
&ellip.4
&ellip.6
&ellip.64
&ellip.?"
&ellip.two
&times.

my script:

import xml.etree.ElementTree as et
s1 = 'train-fix.xml' # from http://www.d.umn.edu/~tpederse/Data/Sval1to2.fix.tar.gz
tree = et.parse(s1)
root = tree.getroot()

gives this traceback:

Traceback (most recent call last):
  File "senseval.py", line 4, in <module>
    tree = et.parse(s1)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1182, in parse
    tree.parse(source, parser)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 656, in parse
    parser.feed(data)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1642, in feed
    self._raiseerror(v)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror
    raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 41, column 113
like image 392
alvas Avatar asked Sep 26 '13 14:09

alvas


Video Answer


1 Answers

The "words" look like malformed entity references. A valid entity reference has a semicolon at the end. I looked at test-fix.xml (in Sval1to2.fix.tar.gz) and it seems very likely that &dash (or &dash.) is meant to represent some kind of dash or hyphen. The file has the .xml extension and it would be fairly close to being well-formed XML if the bad entity references were fixed.

On the page that you link to (http://www.d.umn.edu/~tpederse/data.html), it says:

Please note that our converted data will not "parse" as true xml text. This is due to the fact that in the original sense-tagged text, characters that require special handling in xml are not escaped, and so forth. We are considering ways to make this data "true" xml, and would be most grateful for any feedback on how to best do this.

So even though the document looks very much like XML, it is not XML and the people who published it are well aware of that.

like image 155
mzjn Avatar answered Oct 05 '22 07:10

mzjn