<p>I have an XML document which reads like this:</p> <pre class="prettyprint"><code><xml> <web:Web> <web:Total>4000</web:Total> <web:Offset>0</web:Offset> </web:Web> </xml> </code></pre> <p>my question is how do I access them using a library like BeautifulSoup in python? </p> <p>xmlDom.web["Web"].Total ? does not work? </p>

<h3>Environment</h3> <pre class="prettyprint"><code>import bs4 bs4.__version__ --- 4.10.0' import sys print(sys.version) --- 3.8.10 (default, Nov 26 2021, 20:14:08) [GCC 9.3.0] </code></pre> <h3>BS4/XML Parser on XML with namespace definition</h3> <pre class="prettyprint"><code>from bs4 import BeautifulSoup xbrl_with_namespace = """ <?xml version="1.0" encoding="UTF-8"?> <xbrl xmlns:dei="http://xbrl.sec.gov/dei/2020-01-31" > <dei:EntityRegistrantName> Hoge, Inc. </dei:EntityRegistrantName> </xbrl> """ soup = BeautifulSoup(xbrl_with_namespace, 'xml') registrant = soup.find("dei:EntityRegistrantName") print(registrant.prettify()) --- <dei:EntityRegistrantName> Hoge, Inc. </dei:EntityRegistrantName> </code></pre> <h3>BS4/XML Parser on XML without namespace definition</h3> <pre class="prettyprint"><code>xbrl_without_namespace = """ <?xml version="1.0" encoding="UTF-8"?> <dei:EntityRegistrantName> Hoge, Inc. </dei:EntityRegistrantName> </xbrl> """ soup = BeautifulSoup(xbrl_without_namespace, 'xml') registrant = soup.find("dei:EntityRegistrantName") print(registrant) --- None </code></pre> <h3>BS4/HTML Parser on XML without namespace definition</h3> <p>BS4/HTML parser regards <code><namespace>:<tag></code> as a single tag, besides it lower the letters.</p> <pre class="prettyprint"><code>soup = BeautifulSoup(xbrl_without_namespace, 'html.parser') registrant = soup.find("dei:EntityRegistrantName".lower()) print(registrant) --- <dei:entityregistrantname> Hoge, Inc. </dei:entityregistrantname> </code></pre> <p>Does not match with capital letters as they have been converted into lower letters.</p> <pre class="prettyprint"><code>registrant = soup.find("dei:EntityRegistrantName") print(registrant) --- None </code></pre> <h3>Conclusion</h3> <ol> <li>Provide the namespace definitions to use namespaces with XML parser, OR</li> <li>Use HTML parser and handle with all small letters.</li> </ol>

How can I access namespaced XML elements using BeautifulSoup?

<xml>
<web:Web>
<web:Total>4000</web:Total>
<web:Offset>0</web:Offset>
</web:Web>
</xml>

my question is how do I access them using a library like BeautifulSoup in python?

xmlDom.web["Web"].Total ? does not work?

326

asked Jun 17 '10 04:06

demos

1 Answers

Environment

import bs4
bs4.__version__
---
4.10.0'

import sys
print(sys.version)
---
3.8.10 (default, Nov 26 2021, 20:14:08) 
[GCC 9.3.0]

BS4/XML Parser on XML with namespace definition

from bs4 import BeautifulSoup

xbrl_with_namespace = """
<?xml version="1.0" encoding="UTF-8"?>
<xbrl
    xmlns:dei="http://xbrl.sec.gov/dei/2020-01-31"
>
<dei:EntityRegistrantName>
Hoge, Inc.
</dei:EntityRegistrantName>
</xbrl>
"""

soup = BeautifulSoup(xbrl_with_namespace, 'xml')
registrant = soup.find("dei:EntityRegistrantName")
print(registrant.prettify())
---
<dei:EntityRegistrantName>
Hoge, Inc.
</dei:EntityRegistrantName>

BS4/XML Parser on XML without namespace definition

xbrl_without_namespace = """
<?xml version="1.0" encoding="UTF-8"?>
<dei:EntityRegistrantName>
Hoge, Inc.
</dei:EntityRegistrantName>
</xbrl>
"""

soup = BeautifulSoup(xbrl_without_namespace, 'xml')
registrant = soup.find("dei:EntityRegistrantName")
print(registrant)
---
None

BS4/HTML Parser on XML without namespace definition

BS4/HTML parser regards <namespace>:<tag> as a single tag, besides it lower the letters.

soup = BeautifulSoup(xbrl_without_namespace, 'html.parser')
registrant = soup.find("dei:EntityRegistrantName".lower()) 

print(registrant)
---
<dei:entityregistrantname>
Hoge, Inc.
</dei:entityregistrantname>

Does not match with capital letters as they have been converted into lower letters.

registrant = soup.find("dei:EntityRegistrantName") 
print(registrant)
---
None

Conclusion

Provide the namespace definitions to use namespaces with XML parser, OR
Use HTML parser and handle with all small letters.

108

answered Oct 06 '22 10:10

mon

Related questions
                            
                                How can I change the image size of a Plotly saved image?
                            
                                python3 dataclass with **kwargs(asterisk)
                            
                                Numpy in-place operation performance
                            
                                How to improve network graph visualization? [closed]
                            
                                What is the correct way in python to annotate a path with type hints? [duplicate]
                            
                                pandas overwrite values in multiple columns at once based on condition of values in one column
                            
                                Can you have an async handler in Lambda Python 3.6?
                            
                                Postgresql partition and sqlalchemy
                            
                                Python in R - Error: could not find a Python environment for /usr/bin/python
                            
                                (Python: discord.py) ERROR: Could not build wheels for multidict, yarl which use PEP 517 and cannot be installed directly
                            
                                Letting users upload Python scripts for execution
                            
                                How do you do something after you render the view? (Django)
                            
                                Is it a good idea to use super() in Python?
                            
                                Encoding issues with python's etree.tostring
                            
                                How do you flush Python sockets?
                            
                                urllib2 not retrieving entire HTTP response
                            
                                Pythonic and efficient way of finding adjacent cells in grid
                            
                                Correct way to put long function calls on multiple lines
                            
                                Forwarding an email with python smtplib
                            
                                Programmatically sync the db in Django

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I access namespaced XML elements using BeautifulSoup?

Tags:

python

xml

xml-parsing

beautifulsoup

xml-namespaces