I have an XML document which reads like this:
<xml>
<web:Web>
<web:Total>4000</web:Total>
<web:Offset>0</web:Offset>
</web:Web>
</xml>
my question is how do I access them using a library like BeautifulSoup in python?
xmlDom.web["Web"].Total ? does not work?
Beautiful Soup provides "find()" and "find_all()" functions to get the specific data from the HTML file by putting the specific tag in the function. find() function - return the first element of given tag. find_all() function - return the all the element of given tag.
bs4: Beautiful Soup is a Python library for pulling data out of HTML and XML files.
To read an XML file using ElementTree, firstly, we import the ElementTree class found inside xml library, under the name ET (common convension). Then passed the filename of the xml file to the ElementTree. parse() method, to enable parsing of our xml file. Then got the root (parent tag) of our xml file using getroot().
An XML namespace is a collection of names that can be used as element or attribute names in an XML document. The namespace qualifies element names uniquely on the Web in order to avoid conflicts between elements with the same name.
import bs4
bs4.__version__
---
4.10.0'
import sys
print(sys.version)
---
3.8.10 (default, Nov 26 2021, 20:14:08)
[GCC 9.3.0]
from bs4 import BeautifulSoup
xbrl_with_namespace = """
<?xml version="1.0" encoding="UTF-8"?>
<xbrl
xmlns:dei="http://xbrl.sec.gov/dei/2020-01-31"
>
<dei:EntityRegistrantName>
Hoge, Inc.
</dei:EntityRegistrantName>
</xbrl>
"""
soup = BeautifulSoup(xbrl_with_namespace, 'xml')
registrant = soup.find("dei:EntityRegistrantName")
print(registrant.prettify())
---
<dei:EntityRegistrantName>
Hoge, Inc.
</dei:EntityRegistrantName>
xbrl_without_namespace = """
<?xml version="1.0" encoding="UTF-8"?>
<dei:EntityRegistrantName>
Hoge, Inc.
</dei:EntityRegistrantName>
</xbrl>
"""
soup = BeautifulSoup(xbrl_without_namespace, 'xml')
registrant = soup.find("dei:EntityRegistrantName")
print(registrant)
---
None
BS4/HTML parser regards <namespace>:<tag>
as a single tag, besides it lower the letters.
soup = BeautifulSoup(xbrl_without_namespace, 'html.parser')
registrant = soup.find("dei:EntityRegistrantName".lower())
print(registrant)
---
<dei:entityregistrantname>
Hoge, Inc.
</dei:entityregistrantname>
Does not match with capital letters as they have been converted into lower letters.
registrant = soup.find("dei:EntityRegistrantName")
print(registrant)
---
None
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With