How to remove whitespace in BeautifulSoup

Question

I have a bunch of HTML I'm parsing with BeautifulSoup and it's been going pretty well except for one minor snag. I want to save the output into a single-lined string, with the following as my current output:

    <li><span class="plaincharacterwrap break">
                    Zazzafooky but one two three!
                </span></li>
<li><span class="plaincharacterwrap break">
                    Zazzafooky2
                </span></li>
<li><span class="plaincharacterwrap break">
                    Zazzafooky3
                </span></li>

Ideally I'd like

<li><span class="plaincharacterwrap break">Zazzafooky but one two three!</span></li><li><span class="plaincharacterwrap break">Zazzafooky2</span></li>

There's a lot of redundant whitespace that I'd like to get rid of but it's not necessarily removable using strip(), nor can I blatantly remove all the spaces because I need to retain the text. How can I do it? It seems like a common enough problem that regex would be overkill, but is that the only way?

I don't have any <pre> tags so I can be a little more forceful there.

Thanks once again!

Andrew Clark · Accepted Answer

Here is how you can do it without regular expressions:

>>> html = """    <li><span class="plaincharacterwrap break">
...                     Zazzafooky but one two three!
...                 </span></li>
... <li><span class="plaincharacterwrap break">
...                     Zazzafooky2
...                 </span></li>
... <li><span class="plaincharacterwrap break">
...                     Zazzafooky3
...                 </span></li>
... """
>>> html = "".join(line.strip() for line in html.split("
"))
>>> html
'<li><span class="plaincharacterwrap break">Zazzafooky but one two three!</span></li><li><span class="plaincharacterwrap break">Zazzafooky2</span></li><li><span class="plaincharacterwrap break">Zazzafooky3</span></li>'

twig · Answer

Old question, I know, but beautifulsoup4 has this helper called stripped_strings.

Try this:

description_el = about.find('p', { "class": "description" })
descriptions = list(description_el.stripped_strings)
description = "

".join(descriptions) if descriptions else ""

How to remove whitespace in BeautifulSoup

Tags:

python

regex

html-parsing

beautifulsoup

Rio

2 Answers

Andrew Clark

twig

Recent Activity

Donate For Us

How to remove whitespace in BeautifulSoup

Tags:

python

regex

html-parsing

beautifulsoup

Rio

2 Answers

Andrew Clark

twig

Related questions

Recent Activity

Donate For Us