I'm trying to write a program to parse a series of HTML files and store the resulting data in a .csv spreadsheet, which is incredibly reliant on newlines being in exactly the right place. I've tried every method I can find to strip the linebreaks away from certain pieces of text, to no avail. The relevant code looks like this:
soup = BeautifulSoup(f)
ID = soup.td.get_text()
ID.strip()
ID.rstrip()
ID.replace("\t", "").replace("\r", "").replace("\n", "")
dateCreated = soup.td.find_next("td").get_text()
dateCreated.replace("\t", "").replace("\r", "").replace("\n", "")
dateCreated.strip()
dateCreated.rstrip()
# debug
print('ID:' + ID + 'Date Created:' + dateCreated)
And the resulting code looks like this:
ID:
FOO
Date Created:
BAR
This and another problem with the same program have been driving me up the wall. Help would be fantastic. Thanks.
EDIT: Figured it out, and it was a pretty stupid mistake. Instead of just doing
ID.replace("\t", "").replace("\r", "").replace("\n", "")
I should have done
ID = ID.replace("\t", "").replace("\r", "").replace("\n", "")
We can use the str. strip() method to get rid of extra spaces from the HTML document as shown below.
BeautifulSoup is a Python package that parses broken HTML, just like lxml supports it based on the parser of libxml2.
If you need to remove line breaks from text with Python you can use next string method: replace(old, new [, count])
Your issue at hand is that you're expecting in-place operations from what are actually operations that return new values.
ID.strip() # returns the rstripped value, doesn't change ID.
ID = ID.strip() # Would be more appropriate.
You could use regex, though regex is overkill for this process. Realistically, especially if it's beginning and ending characters, just pass them to strip:
ID = ID.strip('\t\r\n')
There is an internal implementation of Stripped Strings for BeautifulSoup4
These strings tend to have a lot of extra whitespace, which you can remove by using the .stripped_strings generator instead: BS4 Doc stripped_strings
html_doc="""<div class="path">
<a href="#"> abc</a>
<a href="#"> def</a>
<a href="#"> ghi</a>
</div>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, "html.parser")
result_list = []
for s in soup.select("div.path"):
result_list.extend(s.stripped_strings)
print " ".join(result_list)
Output: abc def ghi
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With