Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting text from XML node with minidom

I've looked through several posts but I haven't quite found any answers that have solved my problem.

Sample XML =

<TextWithNodes>
<Node id="0"/>TEXT1<Node id="19"/>TEXT2 <Node id="20"/>TEXT3<Node id="212"/>
</TextWithNodes>

So I understand that usually if I had extracted TextWithNodes as a NodeList I would do something like

nodeList = TextWithNodes[0].getElementsByTagName('Node')
for a in nodeList:
    node = a.nodeValue
    print node

All I get is None. I've read that you must write a.childNodes.nodeValue but there isn't a child node to the node list since it looks like all the Node Ids are closing tags? If I use a.childNodes I get [].

When I get the node type for a it is type 1 and TEXT_NODE = 3. I'm not sure if that is helpful.

I would like to extract TEXT1, TEXT2, etc.

like image 603
Jasmine Avatar asked Dec 27 '25 21:12

Jasmine


2 Answers

A solution with lxml right from the docs:

from lxml import etree
from StringIO import StringIO

xml = etree.parse(StringIO('''<TextWithNodes>
<Node id="0"/>TEXT1<Node id="19"/>TEXT2 <Node id="20"/>TEXT3<Node id="212"/></TextWithNodes>'''))

xml.xpath("//text()")
Out[43]: ['\n', 'TEXT1', 'TEXT2 ', 'TEXT3']

You also can extract the text of an specific node:

xml.find(".//Node[@id='19']").text

The issue here is the text in the XML doesn't belong to any node.

like image 79
Diego Navarro Avatar answered Dec 31 '25 00:12

Diego Navarro


You should use the ElementTree api instead of minidom for your task (as explained in the other answers here), but if you need to use minidom, here is a solution.

What you are looking for was added to DOM level 3 as the textContent attribute. Minidom only supports level 1.

However you can emulate textContent pretty closely with this function:

def textContent(node):
    if node.nodeType in (node.TEXT_NODE, node.CDATA_SECTION_NODE):
        return node.nodeValue
    else:
        return ''.join(textContent(n) for n in node.childNodes)

Which you can then use like so:

x = minidom.parseString("""<TextWithNodes>
<Node id="0"/>TEXT1<Node id="19"/>TEXT2 <Node id="20"/>TEXT3<Node id="212"/></TextWithNodes>""")

twn = x.getElementsByTagName('TextWithNodes')[0]

assert textContent(twn) == u'\nTEXT1TEXT2 TEXT3'

Notice how I got the text content of the parent node TextWithNodes. This is because your Node elements are siblings of those text nodes, not parents of them.

like image 39
Francis Avila Avatar answered Dec 30 '25 22:12

Francis Avila



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!