Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can't remove line breaks from BeautifulSoup text output (Python 2.7.5)

I'm trying to write a program to parse a series of HTML files and store the resulting data in a .csv spreadsheet, which is incredibly reliant on newlines being in exactly the right place. I've tried every method I can find to strip the linebreaks away from certain pieces of text, to no avail. The relevant code looks like this:

soup = BeautifulSoup(f)
ID = soup.td.get_text()
ID.strip()
ID.rstrip()
ID.replace("\t", "").replace("\r", "").replace("\n", "")
dateCreated = soup.td.find_next("td").get_text()
dateCreated.replace("\t", "").replace("\r", "").replace("\n", "")
dateCreated.strip()
dateCreated.rstrip()
# debug
print('ID:' + ID + 'Date Created:' + dateCreated)

And the resulting code looks like this:

ID:
FOO
Date Created:
BAR

This and another problem with the same program have been driving me up the wall. Help would be fantastic. Thanks.

EDIT: Figured it out, and it was a pretty stupid mistake. Instead of just doing

ID.replace("\t", "").replace("\r", "").replace("\n", "")

I should have done

ID = ID.replace("\t", "").replace("\r", "").replace("\n", "")
like image 560
Ben Forde Avatar asked Jul 22 '14 03:07

Ben Forde


People also ask

How do I remove a space from BeautifulSoup?

We can use the str. strip() method to get rid of extra spaces from the HTML document as shown below.

Can BeautifulSoup handle broken HTML?

BeautifulSoup is a Python package that parses broken HTML, just like lxml supports it based on the parser of libxml2.

How do you change a line break in Python?

If you need to remove line breaks from text with Python you can use next string method: replace(old, new [, count])


2 Answers

Your issue at hand is that you're expecting in-place operations from what are actually operations that return new values.

ID.strip() # returns the rstripped value, doesn't change ID.
ID = ID.strip() # Would be more appropriate.

You could use regex, though regex is overkill for this process. Realistically, especially if it's beginning and ending characters, just pass them to strip:

ID = ID.strip('\t\r\n')
like image 67
g.d.d.c Avatar answered Sep 30 '22 05:09

g.d.d.c


There is an internal implementation of Stripped Strings for BeautifulSoup4

These strings tend to have a lot of extra whitespace, which you can remove by using the .stripped_strings generator instead: BS4 Doc stripped_strings

html_doc="""<div class="path">
    <a href="#"> abc</a>
    <a href="#"> def</a>
    <a href="#"> ghi</a>
</div>"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, "html.parser")

result_list = []
for s in soup.select("div.path"):
    result_list.extend(s.stripped_strings)

print  " ".join(result_list)

Output: abc def ghi
like image 37
pymen Avatar answered Sep 30 '22 06:09

pymen