Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create raw text from XML tags

I have some XML that runs through an NLP processor. I have to modify the output in a Python script, so no XSLT for me. I'm trying to extract the raw text all within <TXT> and </TXT> as a string from my XML but I'm stuck on how to pull this from ElementTree.

My code up to this point is

import xml.etree.ElementTree as ET

xml_doc = """<?xml version="1.0" encoding="UTF-8"?>
<NORMDOC>
   <DOC>
      <DOCID>112233</DOCID>
      <TXT>
        <S sid="112233-SENT-001"><ENAMEX type="PERSON" id="PER-112233-001">George Washington</ENAMEX> and <ENAMEX type="PERSON" id="PER-112233-002">Thomas Jefferson</ENAMEX> were both founding fathers.</S>
        <S sid="112233-SENT-002"><ENAMEX type="PERSON" id="PER-112233-002">Thomas Jefferson</ENAMEX> has a social security number of <IDEX type="SSN" id="SSN-112233-075">222-22-2222</IDEX>.</S>
      </TXT>
   </DOC>
</NORMDOC>
"""

tree = ET.parse(xml_doc) # xml_doc is actually a file, but for reproducability it's the above xml

and from there I want to extract everything within TXT as a string stripped of tags. It must be a string for some other processes further down the line. i'd like to look like output_txt below.

output_txt = "George Washington and Thomas Jefferson were both founding fathers. Thomas Jefferson has a social security number of 222-22-2222."

I imagine this should be fairly easy and straightforward, but I just can't figure it out. I tried using this solution but I got AttributeError: 'ElementTree' object has no attribute 'itertext' and it would strip all tags in the xml rather just between <TXT> and </TXT>.

like image 229
carousallie Avatar asked May 15 '26 05:05

carousallie


1 Answers

Normally I'd use plain XPath to do this:

normalize-space(//TXT)

However, the XPath support in ElementTree is limited so you'd only be able to do this in lxml.

To do it in ElementTree, I'd do it similar to the answer you linked to in your question; force it to plain text with tostring using method="text". You'd also want to normalize the whitespace.

Example...

import xml.etree.ElementTree as ET

xml_doc = """<?xml version="1.0" encoding="UTF-8"?>
<NORMDOC>
   <DOC>
      <DOCID>112233</DOCID>
      <TXT>
        <S sid="112233-SENT-001"><ENAMEX type="PERSON" id="PER-112233-001">George Washington</ENAMEX> and <ENAMEX type="PERSON" id="PER-112233-002">Thomas Jefferson</ENAMEX> were both founding fathers.</S>
        <S sid="112233-SENT-002"><ENAMEX type="PERSON" id="PER-112233-002">Thomas Jefferson</ENAMEX> has a social security number of <IDEX type="SSN" id="SSN-112233-075">222-22-2222</IDEX>.</S>
      </TXT>
   </DOC>
</NORMDOC>
"""

tree = ET.fromstring(xml_doc)

txt = tree.find(".//TXT")
raw_text = ET.tostring(txt, encoding='utf8', method='text').decode()
normalized_text = " ".join(raw_text.split())
print(normalized_text)

Printed output...

George Washington and Thomas Jefferson were both founding fathers. Thomas Jefferson has a social security number of 222-22-2222.
like image 116
Daniel Haley Avatar answered May 16 '26 17:05

Daniel Haley



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!