That is, all text and subtags, without the tag of an element itself?
Having
<p>blah <b>bleh</b> blih</p>
I want
blah <b>bleh</b> blih
element.text returns "blah " and etree.tostring(element) returns:
<p>blah <b>bleh</b> blih</p>
Parsing a file object or a filename with parse() returns an instance of the ET. ElementTree class, which represents the whole element hierarchy. On the other hand, parsing a string with fromstring() will return the specific root ET.
ElementTree works perfectly, you have to assemble the answer yourself. Something like this...
"".join( [ "" if t.text is None else t.text ] + [ xml.tostring(e) for e in t.getchildren() ] )
Thanks to JV amd PEZ for pointing out the errors.
Edit.
>>> import xml.etree.ElementTree as xml
>>> s= '<p>blah <b>bleh</b> blih</p>\n'
>>> t=xml.fromstring(s)
>>> "".join( [ t.text ] + [ xml.tostring(e) for e in t.getchildren() ] )
'blah <b>bleh</b> blih'
>>>
Tail not needed.
This is the solution I ended up using:
def element_to_string(element):
s = element.text or ""
for sub_element in element:
s += etree.tostring(sub_element)
s += element.tail
return s
These are good answers, which answer the OP's question, particularly if the question is confined to HTML. But documents are inherently messy, and the depth of element nesting is usually impossible to predict.
To simulate DOM's getTextContent() you would have to use a (very) simple recursive mechanism.
To get just the bare text:
def get_deep_text( element ):
text = element.text or ''
for subelement in element:
text += get_deep_text( subelement )
text += element.tail or ''
return text
print( get_deep_text( element_of_interest ))
To get all the details about the boundaries between raw text:
class holder: pass # this is just a way of creating a holder object
holder.element_count = 0
def get_deep_text_w_boundaries(element, depth = 0):
holder.element_count += 1
element_no = holder.element_count
indent = depth * ' '
text1 = f'{indent}(el {element_no} tag {element.tag}: text |{element.text or ""}| - attribs: {element.attrib})'
print(text1)
for subelement in element:
get_deep_text_w_boundaries(subelement, depth + 1)
text2 = f'{indent}(el {element_no} tag {element.tag} - tail: |{element.tail or ""}|)'
print(text2)
get_deep_text_w_boundaries(etree_element)
Example output:
(el 1 tag source: text |DEVANT LE | - attribs: {})
(el 2 tag g: text |TRIBUNAL JUDICIAIRE| - attribs: {'style_no': '3'})
(el 2 tag g - tail: ||)
(el 1 tag source - tail: | DE VERSAILLES|)
I doubt ElementTree is the thing to use for this. But assuming you have strong reasons for using it maybe you could try stripping the root tag from the fragment:
re.sub(r'(^<%s\b.*?>|</%s\b.*?>$)' % (element.tag, element.tag), '', ElementTree.tostring(element))
Most of the answers here are based on the XML parser ElementTree
, even PEZ's regex-based answer still partially relies on ElementTree.
All those are good and suitable for most use cases but, just for the sake of completeness, it is worth noting that, ElementTree.tostring(...)
will give you an equivalent snippet, but not always identical to the original payload. If, for some very rare reason, that you want to extract the content as-is, you have to use a pure regex-based solution. This example is how I use regex-based solution.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With