BeautifulSoup has logic for closing consecutive <code> </code> tags that doesn't do quite what I want it to do. For example, <pre class="prettyprint"><code>>>> from bs4 import BeautifulSoup >>> bs = BeautifulSoup('one two three four') </code></pre> The HTML would render as <pre class="prettyprint"><code>one two three four </code></pre> I'd like to parse it into a list of strings, <code>['one','two','three','four']</code>. BeautifulSoup's tag-closing logic means that I get nested tags when I ask for all the <code> </code> elements. <pre class="prettyprint"><code>>>> bs('br') [ two three four, three four, four] </code></pre> Is there a simple way to get the result I want?

<pre class="prettyprint"><code>import bs4 as bs soup = bs.BeautifulSoup('one two three four') print(soup.find_all(text=True)) </code></pre> yields <pre class="prettyprint"><code>[u'one', u'two', u'three', u'four'] </code></pre> <hr> Or, using lxml: <pre class="prettyprint"><code>import lxml.html as LH doc = LH.fromstring('one two three four') print(list(doc.itertext())) </code></pre> yields <pre class="prettyprint"><code>['one', 'two', 'three', 'four'] </code></pre>

Parsing unclosed ` ` tags with BeautifulSoup

Tags:

python

html

beautifulsoup

BeautifulSoup has logic for closing consecutive   tags that doesn't do quite what I want it to do. For example,

Click to copy

>>> from bs4 import BeautifulSoup
>>> bs = BeautifulSoup('one<br>two<br>three<br>four')

The HTML would render as

Click to copy

one
two
three
four

I'd like to parse it into a list of strings, ['one','two','three','four']. BeautifulSoup's tag-closing logic means that I get nested tags when I ask for all the   elements.

Click to copy

>>> bs('br')
[<br>two<br>three<br>four</br></br></br>,
 <br>three<br>four</br></br>,
 <br>four</br>]

Is there a simple way to get the result I want?

468

asked Nov 20 '12 20:11

Chris Taylor

1 Answers

Click to copy

import bs4 as bs
soup = bs.BeautifulSoup('one<br>two<br>three<br>four')
print(soup.find_all(text=True))

yields

Click to copy

[u'one', u'two', u'three', u'four']

Or, using lxml:

Click to copy

import lxml.html as LH
doc = LH.fromstring('one<br>two<br>three<br>four')
print(list(doc.itertext()))

yields

Click to copy

['one', 'two', 'three', 'four']

199

answered Oct 05 '22 06:10

unutbu

Related questions
                            
                                EnumChildWindows not working in pywin32
                            
                                Difference between pygame.draw and pygame.gfxdraw
                            
                                Pretty Print output in a sideways tree format in console window
                            
                                Plotting elliptical orbits
                            
                                Python threading, threads do not close
                            
                                Circus, running circusd as a daemon?
                            
                                tuple to dict:one key and multiple values
                            
                                Why doesn't __getattr__ work with __exit__?
                            
                                I have an Errno 13 Permission denied with subprocess in python
                            
                                Python: Print next x lines from text file when hitting string
                            
                                Better platform to turn software into VHDL/Verilog for an FPGA
                            
                                threads vs. processes in Python
                            
                                Python quickly hash mutable object
                            
                                python urllib usage
                            
                                Basic Event Loop in Python [duplicate]
                            
                                Code to detect all words that start with a capital letter in a string
                            
                                Correct use of static methods
                            
                                Create and lookup 2D dictionary with multiple keys per value
                            
                                Powershell Python: Change version used
                            
                                matplotlib sequence of figures in the same window

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parsing unclosed `<br>` tags with BeautifulSoup

Tags:

python

html

beautifulsoup

Chris Taylor

People also ask

1 Answers

unutbu

Recent Activity

Donate For Us