Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing unclosed `<br>` tags with BeautifulSoup

BeautifulSoup has logic for closing consecutive <br> tags that doesn't do quite what I want it to do. For example,

>>> from bs4 import BeautifulSoup
>>> bs = BeautifulSoup('one<br>two<br>three<br>four')

The HTML would render as

one
two
three
four

I'd like to parse it into a list of strings, ['one','two','three','four']. BeautifulSoup's tag-closing logic means that I get nested tags when I ask for all the <br> elements.

>>> bs('br')
[<br>two<br>three<br>four</br></br></br>,
 <br>three<br>four</br></br>,
 <br>four</br>]

Is there a simple way to get the result I want?

like image 468
Chris Taylor Avatar asked Nov 20 '12 20:11

Chris Taylor


People also ask

How do you remove Br from text in Python?

Use str. replace() to remove all line breaks from a string.

What is the use of beautiful Soup in Python?

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.


1 Answers

import bs4 as bs
soup = bs.BeautifulSoup('one<br>two<br>three<br>four')
print(soup.find_all(text=True))

yields

[u'one', u'two', u'three', u'four']

Or, using lxml:

import lxml.html as LH
doc = LH.fromstring('one<br>two<br>three<br>four')
print(list(doc.itertext()))

yields

['one', 'two', 'three', 'four']
like image 199
unutbu Avatar answered Oct 05 '22 06:10

unutbu