Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

lxml: some XML from URL give this lxml.etree.XMLSyntaxError

I have a script which is suppose to extract some terms from XML files from a list of URLs. All the URL's give access to XML data.

It is working fine at first opening, parsing and extracting correctly but then get interrupted in the process by some XML files with this error:

File "<stdin>", line 18, in <module>
  File "lxml.etree.pyx", line 2953, in lxml.etree.parse (src/lxml/lxml.etree.c:56204)
  File "parser.pxi", line 1555, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:82511)
  File "parser.pxi", line 1585, in lxml.etree._parseFilelikeDocument (src/lxml/lxml.etree.c:82832)
  File "parser.pxi", line 1468, in lxml.etree._parseDocFromFilelike (src/lxml/lxml.etree.c:81688)
  File "parser.pxi", line 1024, in lxml.etree._BaseParser._parseDocFromFilelike (src/lxml/lxml.etree.c:78735)
  File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74472)
  File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75363)
  File "parser.pxi", line 590, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74696)
lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1

From my search it might be because some XML files have white spaces but i'm not sure if it is the problem. I can't tell which files give the error. Is there a way to get around this error?

Here is my script:

URLlist = ["http://www.uniprot.org/uniprot/"+x+".xml" for x in IDlist]
for id, item in zip(IDlist, URLlist):
    goterm_location = []
    goterm_function = []
    goterm_process = []
    location_list[id] = []
    function_list[id] = []
    biological_list[id] = []
    try:
        textfile = urllib2.urlopen(item);
    except urllib2.HTTPError:
        print("URL", item, "could not be read.")
        continue
    #Try to solve empty line error#
    tree = etree.parse(textfile);
    #root = tree.getroot()
    for node in tree.iter('{http://uniprot.org/uniprot}dbReference'):
        if node.attrib.get('type') == 'GO':
            for child in node:
                value = child.attrib.get('value');
                if value.startswith('C:'):
                    goterm_C = node.attrib.get('id')
                    if goterm_C:
                        location_list[id].append(goterm_C);
                if value.startswith('F:'):
                    goterm_F = node.attrib.get('id')
                    if goterm_F:
                        function_list[id].append(goterm_F);
                if value.startswith('P:'):
                    goterm_P = node.attrib.get('id')
                    if goterm_P:
                        biological_list[id].append(goterm_P);

I have tried:

tree = etree.iterparse(textfile, events = ("start","end"));
OR
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(textfile, parser)

Without success. Any help would be greatly appreciated

like image 747
Jérémz Avatar asked Oct 06 '15 01:10

Jérémz


2 Answers

I can't tell which files give the error

Debug by printing the name of the file/URL prior to parsing. Then you'll see which file(s) cause the error.

Also, read the error message:

lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1

this suggests that the downloaded XML file is empty. Once you have determined the URL(s) that cause the problem, try downloading the file and check its contents. I suspect it might be empty.

You can ignore problematic files (empty or otherwise syntactically invalid) by using a try/except block when parsing:

try:
    tree = etree.parse(textfile)
except lxml.etree.XMLSyntaxError:
    print 'Skipping invalid XML from URL {}'.format(item)
    continue    # go on to the next URL

Or you could check just for empty files by checking the 'Content-length' header, or even by reading the resource returned by urlopen(), but I think that the above is better as it will also catch other potential errors.

like image 198
mhawke Avatar answered Nov 02 '22 10:11

mhawke


I got the same error message in Python 3.6

lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1

In my case the xml file is not empty. Issue is because of encoding,

Initially used utf-8,

from lxml import etree
etree.iterparse(my_xml_file.xml, tag='MyTag', encoding='utf-8')

changing encoding to iso-8859-1 solved my issue,

etree.iterparse(my_xml_file.xml, tag='MyTag', encoding='iso-8859-1')
like image 20
John Prawyn Avatar answered Nov 02 '22 11:11

John Prawyn