This question is specific to BeautifulSoup4, which makes it different from the previous questions: Why is BeautifulSoup modifying my self-closing elements? selfClosingTags in BeautifulSoup Since <code>BeautifulStoneSoup</code> is gone (the previous xml parser), how can I get <code>bs4</code> to respect a new self-closing tag? For example: <pre class="prettyprint"><code>import bs4 S = '''<foo> <bar a="3"/> </foo>''' soup = bs4.BeautifulSoup(S, selfClosingTags=['bar']) print soup.prettify() </code></pre> Does not self-close the <code>bar</code> tag, but gives a hint. What is this tree builder that bs4 is referring to and how to I self-close the tag? <pre class="prettyprint"><code>/usr/local/lib/python2.7/dist-packages/bs4/__init__.py:112: UserWarning: BS4 does not respect the selfClosingTags argument to the BeautifulSoup constructor. The tree builder is responsible for understanding self-closing tags. "BS4 does not respect the selfClosingTags argument to the " <html> <body> <foo> <bar a="3"> </bar> </foo> </body> </html> </code></pre>

To parse XML you pass in “xml” as the second argument to the BeautifulSoup constructor. <pre class="prettyprint"><code>soup = bs4.BeautifulSoup(S, 'xml') </code></pre> You’ll need to have lxml installed. You don't need to pass <code>selfClosingTags</code> anymore: <pre class="prettyprint"><code>In [1]: import bs4 In [2]: S = '''<foo> <bar a="3"/> </foo>''' In [3]: soup = bs4.BeautifulSoup(S, 'xml') In [4]: print soup.prettify() <?xml version="1.0" encoding="utf-8"?> <foo> <bar a="3"/> </foo> </code></pre>

How to get BeautifulSoup 4 to respect a self-closing tag?

Tags:

python

xml

xml-parsing

beautifulsoup

This question is specific to BeautifulSoup4, which makes it different from the previous questions:

Why is BeautifulSoup modifying my self-closing elements?

selfClosingTags in BeautifulSoup

Since BeautifulStoneSoup is gone (the previous xml parser), how can I get bs4 to respect a new self-closing tag? For example:

import bs4   
S = '''<foo> <bar a="3"/> </foo>'''
soup = bs4.BeautifulSoup(S, selfClosingTags=['bar'])

print soup.prettify()

Does not self-close the bar tag, but gives a hint. What is this tree builder that bs4 is referring to and how to I self-close the tag?

/usr/local/lib/python2.7/dist-packages/bs4/__init__.py:112: UserWarning: BS4 does not respect the selfClosingTags argument to the BeautifulSoup constructor. The tree builder is responsible for understanding self-closing tags.
  "BS4 does not respect the selfClosingTags argument to the "
<html>
 <body>
  <foo>
   <bar a="3">
   </bar>
  </foo>
 </body>
</html>

641

asked Feb 19 '13 15:02

Hooked

1 Answers

To parse XML you pass in “xml” as the second argument to the BeautifulSoup constructor.

soup = bs4.BeautifulSoup(S, 'xml')

You’ll need to have lxml installed.

You don't need to pass selfClosingTags anymore:

In [1]: import bs4
In [2]: S = '''<foo> <bar a="3"/> </foo>'''
In [3]: soup = bs4.BeautifulSoup(S, 'xml')
In [4]: print soup.prettify()
<?xml version="1.0" encoding="utf-8"?>
<foo>
 <bar a="3"/>
</foo>

138

answered Sep 20 '22 07:09

Pavel Anossov

Related questions
                            
                                Getting all visible text from a webpage using Selenium
                            
                                Python Nose: Log tests results to a file with Multiprocess Plugin
                            
                                How can I parse free-text time intervals in Python, ranging from years to seconds?
                            
                                How to print Numpy arrays without any extra notation (square brackets [ ] and spaces between elements)?
                            
                                How do you change the code example font size in LaTeX PDF output with Sphinx?
                            
                                Python/iptables: Capturing all UDP packets and their original destination
                            
                                Subprocess Popen not working with pythonw.exe
                            
                                Installed Python Modules - Python can't find them
                            
                                No cv.Point in Python OpenCV on latest stable Debian
                            
                                Blank label_suffix across entire Django project
                            
                                Trying to serve django static files on development server - not found
                            
                                What is the simple way to merge named tuples in Python?
                            
                                How to run selenium web driver behind a proxy server which needs authentication in python
                            
                                Trouble in parsing date using dateutil
                            
                                how to assign list of values to a key using OrderedDict in python
                            
                                Flask-WTFform: Flash does not display errors
                            
                                numpy diff on a pandas Series
                            
                                How do you do a python 'eval' only within an object context?
                            
                                shlex.split still not supporting unicode?
                            
                                Force a function parameter type in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With