Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting lxml tag attributes with namespaces

My XML looks like:

...
<termEntry id="c1">
    <langSet xml:lang="de">
    ...

And i have the code:

from lxml import etree
...

for term_entry in root.iterfind('.//termEntry'):
    print term_entry.attrib['id']
    print term_entry.nsmap

    for lang_set in term_entry.iterfind('langSet'):
        print lang_set.nsmap
        print lang_set.attrib

        for some_stuff in lang_set.iterfind('some_stuff'):
            ...

I get the empty nsmap dict, and my attrib dict looks like {'{http://www.w3.org/XML/1998/namespace}lang': 'en'}

The file may not contain xml: in namespace, or it may have a different namespace. How can i know what namespace used in the tag declaration? In fact, i just need to get a lang attribute, i don't care what namespace was used. I don't want use any crappy trash like lang_set.attrib.values()[0] or other lookups of a field with the known name.

like image 601
night-crawler Avatar asked Dec 14 '12 02:12

night-crawler


2 Answers

i just need to get a lang attribute, i don't care what namespace was used

Your question is not very clear and you haven't provided any complete runnable code example. But doing some string manipulation as suggested by @mmgp in a comment may be enough.

However, xml:lang is not the same as random_prefix:lang (or just lang). I think you should care about the namespace. If the objective is to identify the natural language that applies to an element's content, then you should be using xml:lang (because that is the explicit purpose of this attribute; see http://www.w3.org/TR/REC-xml/#sec-lang-tag).


I just want to know where is stored the {http://www.w3.org/XML/1998/namespace} string for attributes.

It is important to know that the xml prefix is special. It is reserved (as opposed to almost all other namespace prefixes which are supposed to be arbitrary) and defined to be bound to http://www.w3.org/XML/1998/namespace.

From the Namespaces in XML 1.0 W3C recommendation:

The prefix xml is by definition bound to the namespace name http://www.w3.org/XML/1998/namespace. It MAY, but need not, be declared, and MUST NOT be bound to any other namespace name. Other prefixes MUST NOT be bound to this namespace name, and it MUST NOT be declared as the default namespace.

Other uses of the xml prefix are the xml:space and xml:base attributes.


It is really strange, if lxml does not provide any method for namespace processing

lxml processes namespaces just fine, but prefixes are avoided as much as possible. You will need to use the http://www.w3.org/XML/1998/namespace namespace name when doing lookups that involve the xml prefix.

like image 52
mzjn Avatar answered Sep 23 '22 11:09

mzjn


you could simply use xpath:

lang_set.xpath('./@xml:lang')[0]

by the way, are you working with TBX files?

like image 27
altipard Avatar answered Sep 24 '22 11:09

altipard