Make BeautifulSoup handle line breaks as a browser would

Question

I'm using BeautifulSoup (version '4.3.2' with Python 3.4) to convert html documents to text. The problem I'm having is that sometimes web pages have newline characters " " that wouldn't actually get rendered as a new line in a browser, but when BeautifulSoup converts them to text, it leaves in the " ".

Example:

Your browser probably renders the following all in one line (even though have a newline character in the middle):

This is a paragraph.

And your browser probably renders the following in multiple lines even though I'm entering it with no newlines:

This is a paragraph.

This is another paragraph.

But when BeautifulSoup converts the same strings to text, the only line line breaks it uses are the newline literals - and it always uses them:

from bs4 import BeautifulSoup

doc = "<p>This is a
paragraph.</p>"
soup = BeautifulSoup(doc)

soup.text
Out[181]: 'This is a 
 paragraph.'

doc = "<p>This is a paragraph.</p><p>This is another paragraph.</p>"
soup = BeautifulSoup(doc)

soup.text
Out[187]: 'This is a paragraph.This is another paragraph.'

Does anyone know how to make BeautifulSoup extract text in a more beautiful way (or really just get all the newlines correct)? Are there any other simple ways around the problem?

naoko · Accepted Answer

get_text might be helpful here:

>>> from bs4 import BeautifulSoup
>>> doc = "<p>This is a paragraph.</p><p>This is another paragraph.</p>"
>>> soup = BeautifulSoup(doc)
>>> soup.get_text(separator="
")
u'This is a paragraph.
This is another paragraph.'

Make BeautifulSoup handle line breaks as a browser would

Tags:

python

html

beautifulsoup

line-breaks

KCzar

1 Answers

naoko

Recent Activity

Donate For Us

Make BeautifulSoup handle line breaks as a browser would

Tags:

python

html

beautifulsoup

line-breaks

KCzar

1 Answers

naoko

Related questions

Recent Activity

Donate For Us