That is, all text and subtags, without the tag of an element itself? Having <pre class="prettyprint"><code>blah bleh blih </code></pre> I want <pre class="prettyprint"><code>blah bleh blih </code></pre> element.text returns "blah " and etree.tostring(element) returns: <pre class="prettyprint"><code>blah bleh blih </code></pre>

ElementTree works perfectly, you have to assemble the answer yourself. Something like this... <pre class="prettyprint"><code>"".join( [ "" if t.text is None else t.text ] + [ xml.tostring(e) for e in t.getchildren() ] ) </code></pre> Thanks to JV amd PEZ for pointing out the errors. <hr> Edit. <pre class="prettyprint"><code>>>> import xml.etree.ElementTree as xml >>> s= 'blah bleh blih\n' >>> t=xml.fromstring(s) >>> "".join( [ t.text ] + [ xml.tostring(e) for e in t.getchildren() ] ) 'blah bleh blih' >>> </code></pre> Tail not needed.

I doubt ElementTree is the thing to use for this. But assuming you have strong reasons for using it maybe you could try stripping the root tag from the fragment: <pre class="prettyprint"><code> re.sub(r'(^<%s\b.*?>|</%s\b.*?>$)' % (element.tag, element.tag), '', ElementTree.tostring(element)) </code></pre>

Most of the answers here are based on the XML parser <code>ElementTree</code>, even PEZ's regex-based answer still partially relies on ElementTree. All those are good and suitable for most use cases but, just for the sake of completeness, it is worth noting that, <code>ElementTree.tostring(...)</code> will give you an equivalent snippet, but not always identical to the original payload. If, for some very rare reason, that you want to extract the content as-is, you have to use a pure regex-based solution. This example is how I use regex-based solution.

How do I get the full XML or HTML content of an element using ElementTree?

Tags:

python

xml

api

elementtree

That is, all text and subtags, without the tag of an element itself?

Having

<p>blah <b>bleh</b> blih</p>

I want

blah <b>bleh</b> blih

element.text returns "blah " and etree.tostring(element) returns:

<p>blah <b>bleh</b> blih</p>

904

asked Dec 19 '08 10:12

pupeno

5 Answers

ElementTree works perfectly, you have to assemble the answer yourself. Something like this...

"".join( [ "" if t.text is None else t.text ] + [ xml.tostring(e) for e in t.getchildren() ] )

Thanks to JV amd PEZ for pointing out the errors.

Edit.

>>> import xml.etree.ElementTree as xml
>>> s= '<p>blah <b>bleh</b> blih</p>\n'
>>> t=xml.fromstring(s)
>>> "".join( [ t.text ] + [ xml.tostring(e) for e in t.getchildren() ] )
'blah <b>bleh</b> blih'
>>>

Tail not needed.

121

answered Oct 20 '22 02:10

S.Lott

This is the solution I ended up using:

def element_to_string(element):
    s = element.text or ""
    for sub_element in element:
        s += etree.tostring(sub_element)
    s += element.tail
    return s

answered Oct 20 '22 02:10

pupeno

These are good answers, which answer the OP's question, particularly if the question is confined to HTML. But documents are inherently messy, and the depth of element nesting is usually impossible to predict.

To simulate DOM's getTextContent() you would have to use a (very) simple recursive mechanism.

To get just the bare text:

def get_deep_text( element ):
    text = element.text or ''
    for subelement in element:
        text += get_deep_text( subelement )
    text += element.tail or ''
    return text
print( get_deep_text( element_of_interest ))

To get all the details about the boundaries between raw text:

class holder: pass # this is just a way of creating a holder object
holder.element_count = 0
def get_deep_text_w_boundaries(element, depth = 0):
    holder.element_count += 1
    element_no = holder.element_count 
    indent = depth * '  '
    text1 = f'{indent}(el {element_no} tag {element.tag}: text |{element.text or ""}| - attribs: {element.attrib})' 
    print(text1)
    for subelement in element:
        get_deep_text_w_boundaries(subelement, depth + 1)
    text2 = f'{indent}(el {element_no} tag {element.tag} - tail: |{element.tail or ""}|)' 
    print(text2)
get_deep_text_w_boundaries(etree_element)

Example output:

(el 1 tag source: text |DEVANT LE | - attribs: {})
  (el 2 tag g: text |TRIBUNAL JUDICIAIRE| - attribs: {'style_no': '3'})
  (el 2 tag g - tail: ||)
(el 1 tag source - tail: | DE VERSAILLES|)

answered Oct 20 '22 01:10

mike rodent

I doubt ElementTree is the thing to use for this. But assuming you have strong reasons for using it maybe you could try stripping the root tag from the fragment:

 re.sub(r'(^<%s\b.*?>|</%s\b.*?>$)' % (element.tag, element.tag), '', ElementTree.tostring(element))

answered Oct 20 '22 00:10

PEZ

Most of the answers here are based on the XML parser ElementTree, even PEZ's regex-based answer still partially relies on ElementTree.

All those are good and suitable for most use cases but, just for the sake of completeness, it is worth noting that, ElementTree.tostring(...) will give you an equivalent snippet, but not always identical to the original payload. If, for some very rare reason, that you want to extract the content as-is, you have to use a pure regex-based solution. This example is how I use regex-based solution.

answered Oct 20 '22 00:10

RayLuo

Related questions
                            
                                Running Julia .jl file in python
                            
                                Pandas: convert date 'object' to int
                            
                                Pandas - Add Column Name to Results of groupby [duplicate]
                            
                                Dynamic table with Python
                            
                                Transposing selected MultiIndex levels in Pandas DataFrame
                            
                                Conda command working in command prompt but not in bash script
                            
                                Python 3.6 DateTime Strptime Returns error while Python 3.7 works well
                            
                                Anaconda prompt closes immediately - the system was unable to find the specified registry key or value
                            
                                How to upload multiple files with flask-wtf?
                            
                                Theoretical vs actual time-complexity for algorithm calculating 2^n
                            
                                How to access the network weights while using PyTorch 'nn.Sequential'?
                            
                                how to set logging level from command line
                            
                                How to create a dictionary using a single list?
                            
                                What's the most space-efficient way to compress serialized Python data?
                            
                                Tensorflow 2: how to switch execution from GPU to CPU and back?
                            
                                RuntimeError: __class__ not set defining 'AbstractBaseUser' as <class 'django.contrib.auth.base_user.Abstract BaseUser'>. Was __classcell__ propagated
                            
                                Maintained alternatives to PyPDF2
                            
                                Setup django with WSGI and apache
                            
                                Nginx + fastcgi truncation problem
                            
                                Python regex findall numbers and dots

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I get the full XML or HTML content of an element using ElementTree?

Tags:

python

xml

api

elementtree

pupeno

People also ask

5 Answers

S.Lott

pupeno

mike rodent

PEZ

RayLuo

Recent Activity

Donate For Us