Python lxml error "namespace not defined."

Question

I am being driven crazy by some oddly formed xml and would be grateful for some pointers:

The documents are defined like this:

<sphinx:document id="18059090929806848187">
  <url>http://www.some-website.com</url>
  <page_number>104</page_number>
  <size>7865</size>
</sphinx:document>

Now, I need to read lots (500m+ of these files which are all gz compresed) and grab the text values form a few of the contained tags.

sample code:

from lxml import objectify, etree
import gzip

with open ('file_list','rb') as file_list:
 for file in file_list:
  in_xml = gzip.open(file.strip('
'))
  xml2 = etree.iterparse(in_xml)
  for action, elem in xml2:
   if elem.tag == "page_number":
    print elem.text + str(file)

the first value elem.text is returned but only for the first file in the list and quickly followed by the error:

lxml.etree.XMLSyntaxError: Namespace prefix sphinx on document is not defined, line 1, column 20

Please excuse my ignorance but xml really hurts my head and I have been struggling with this for a while. Is there a way that I can either define the namespace prefix or handle this in some other more intelligent manner?

Thanks

Robᵩ · Accepted Answer

Your input file is not well formed XML. I assume that it is a snippet from a larger XML document.

Your choices are:

Reconstruct the larger document. How you do this is specific to your application. You may have to consult with the people that created the file you are parsing.
Parse the file in spite of its errors. To do that, use the recover keyword from lxml.etree.iterparse:
```
xml2 =etree.iterparse(in_xml, recover=True)
```

Python lxml error "namespace not defined."

Tags:

python

xml

lxml

elementtree

RJJ

1 Answers

Robᵩ

Recent Activity

Donate For Us

Python lxml error "namespace not defined."

Tags:

python

xml

lxml

elementtree

RJJ

1 Answers

Robᵩ

Related questions

Recent Activity

Donate For Us