I have an xml
file of the form:
<NewDataSet>
<Root>
<Phonemic>and</Phonemic>
<Phonetic>nd</Phonetic>
<Description/>
<Start>0</Start>
<End>8262</End>
</Root>
<Root>
<Phonemic>comfortable</Phonemic>
<Phonetic>comfetebl</Phonetic>
<Description>adj</Description>
<Start>61404</Start>
<End>72624</End>
</Root>
</NewDataSet>
I need to process it so that, for instance, when the user inputs nd
, the program matches it with the <Phonetic>
tag and returns and
from the <Phonemic>
part. I thought maybe if I can convert the xml file to a dictionary, I would be able to iterate over the data and find information when needed.
I searched and found xmltodict which is used for the same purpose:
import xmltodict
with open(r'path\to\1.xml', encoding='utf-8', errors='ignore') as fd:
obj = xmltodict.parse(fd.read())
Running this gives me an ordered dict
:
>>> obj
OrderedDict([('NewDataSet', OrderedDict([('Root', [OrderedDict([('Phonemic', 'and'), ('Phonetic', 'nd'), ('Description', None), ('Start', '0'), ('End', '8262')]), OrderedDict([('Phonemic', 'comfortable'), ('Phonetic', 'comfetebl'), ('Description', 'adj'), ('Start', '61404'), ('End', '72624')])])]))])
Now this unfortunately hasn't made things simpler and I am not sure how to go about implementing the program with the new data structure. For example to access nd
I'd have to write:
obj['NewDataSet']['Root'][0]['Phonetic']
which is ridiculously complicated. I tried to make it into a regular dictionary by dict()
but as it is nested, the inner layers remain ordered and my data is so big.
You can actually avoid conversion to OrderedDict by setting an additional keyword paramter:
obj = xmltodict.parse(xmldata, dict_constructor=dict)
parse
is forwarding keyword arguments to _DictSAXHandler
and dict_constructor
is by default set to OrderedDict
.
If you are accessing this as obj['NewDataSet']['Root'][0]['Phonetic']
, IMO, you are not doing it right.
Instead, you can do the following
obj = obj["NewDataSet"]
root_elements = obj["Root"] if type(obj) == OrderedDict else [obj["Root"]]
# Above step ensures that root_elements is always a list
for element in root_elements:
print element["Phonetic"]
Even though this code looks much more longer, the advantage is that it will be lot more compact and modular once you start dealing with sufficiently large xml.
PS: I had the same issues with xmltodict
. But instead of parsing using xml.etree.ElementTree to parse the xml files, xmltodict was much easier to work with as the code base was smaller, and I didn't have to deal with other inanities of the xml module.
EDIT
Following code works for me
import xmltodict
from collections import OrderedDict
xmldata = """<NewDataSet>
<Root>
<Phonemic>and</Phonemic>
<Phonetic>nd</Phonetic>
<Description/>
<Start>0</Start>
<End>8262</End>
</Root>
<Root>
<Phonemic>comfortable</Phonemic>
<Phonetic>comfetebl</Phonetic>
<Description>adj</Description>
<Start>61404</Start>
<End>72624</End>
</Root>
</NewDataSet>"""
obj = xmltodict.parse(xmldata)
obj = obj["NewDataSet"]
root_elements = obj["Root"] if type(obj) == OrderedDict else [obj["Root"]]
# Above step ensures that root_elements is always a list
for element in root_elements:
print element["Phonetic"]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With