Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Make BeautifulSoup handle line breaks as a browser would

I'm using BeautifulSoup (version '4.3.2' with Python 3.4) to convert html documents to text. The problem I'm having is that sometimes web pages have newline characters "\n" that wouldn't actually get rendered as a new line in a browser, but when BeautifulSoup converts them to text, it leaves in the "\n".

Example:

Your browser probably renders the following all in one line (even though have a newline character in the middle):

This is a paragraph.

And your browser probably renders the following in multiple lines even though I'm entering it with no newlines:

This is a paragraph.

This is another paragraph.

But when BeautifulSoup converts the same strings to text, the only line line breaks it uses are the newline literals - and it always uses them:

from bs4 import BeautifulSoup

doc = "<p>This is a\nparagraph.</p>"
soup = BeautifulSoup(doc)

soup.text
Out[181]: 'This is a \n paragraph.'

doc = "<p>This is a paragraph.</p><p>This is another paragraph.</p>"
soup = BeautifulSoup(doc)

soup.text
Out[187]: 'This is a paragraph.This is another paragraph.'

Does anyone know how to make BeautifulSoup extract text in a more beautiful way (or really just get all the newlines correct)? Are there any other simple ways around the problem?

like image 211
KCzar Avatar asked May 19 '15 22:05

KCzar


1 Answers

get_text might be helpful here:

>>> from bs4 import BeautifulSoup
>>> doc = "<p>This is a paragraph.</p><p>This is another paragraph.</p>"
>>> soup = BeautifulSoup(doc)
>>> soup.get_text(separator="\n")
u'This is a paragraph.\nThis is another paragraph.'
like image 200
naoko Avatar answered Sep 20 '22 19:09

naoko