I am trying to open an xml file, and get values from certain tags. I have done this a lot but this particular xml is giving me some issues. Here is a section of the xml file:
<?xml version='1.0' encoding='UTF-8'?>
<package xmlns="http://apple.com/itunes/importer" version="film4.7">
<provider>filmgroup</provider>
<language>en-GB</language>
<actor name="John Smith" display="Doe John"</actor>
</package>
And here is a sample of my python code:
metadata = '/Users/mylaptop/Desktop/Python/metadata.xml'
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
open(metadata)
tree = etree.parse(metadata, parser)
root = tree.getroot()
for element in root.iter(tag='provider'):
providerValue = tree.find('//provider')
providerValue = providerValue.text
print providerValue
tree.write('/Users/mylaptop/Desktop/Python/metadataDone.xml', pretty_print = True, xml_declaration = True, encoding = 'UTF-8')
When I run this it can't find the provider tag or its value. If I remove xmlns="http://apple.com/itunes/importer"
then all work as expected.
My question is how can I remove this namespace, as i'm not at all interested in this, so I can get the tag values I need using lxml?
The provider
tag is in the http://apple.com/itunes/importer
namespace, so you either need to use the fully qualified name
{http://apple.com/itunes/importer}provider
or use one of the lxml methods that has the namespaces
parameter, such as root.xpath
. Then you can specify it with a namespace prefix (e.g. ns:provider
):
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(metadata, parser)
root = tree.getroot()
namespaces = {'ns':'http://apple.com/itunes/importer'}
items = iter(root.xpath('//ns:provider/text()|//ns:actor/@name',
namespaces=namespaces))
for provider, actor in zip(*[items]*2):
print(provider, actor)
yields
('filmgroup', 'John Smith')
Note that the XPath used above assumes that <provider>
and <actor>
elements always appear in alternation. If that is not true, then there are of course ways to handle it, but the code becomes a bit more verbose:
for package in root.xpath('//ns:package', namespaces=namespaces):
for provider in package.xpath('ns:provider', namespaces=namespaces):
providerValue = provider.text
print providerValue
for actor in package.xpath('ns:actor', namespaces=namespaces):
print actor.attrib['name']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With