BeautifulSoup has logic for closing consecutive <br>
tags that doesn't do quite what I want it to do. For example,
>>> from bs4 import BeautifulSoup
>>> bs = BeautifulSoup('one<br>two<br>three<br>four')
The HTML would render as
one
two
three
four
I'd like to parse it into a list of strings, ['one','two','three','four']
. BeautifulSoup's tag-closing logic means that I get nested tags when I ask for all the <br>
elements.
>>> bs('br')
[<br>two<br>three<br>four</br></br></br>,
<br>three<br>four</br></br>,
<br>four</br>]
Is there a simple way to get the result I want?
Use str. replace() to remove all line breaks from a string.
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
import bs4 as bs
soup = bs.BeautifulSoup('one<br>two<br>three<br>four')
print(soup.find_all(text=True))
yields
[u'one', u'two', u'three', u'four']
Or, using lxml:
import lxml.html as LH
doc = LH.fromstring('one<br>two<br>three<br>four')
print(list(doc.itertext()))
yields
['one', 'two', 'three', 'four']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With