I just ran the following code in Python to take all of the certain emails out of an IMAP folder. The extraction part works fine and the BeautifulSoup part works okay, but the output has a lot of '\r' and '\n' within.
I tried to remove these with REGEX sub function but it's not working...not even giving an error message. Any idea what is wrong? I am attaching the code...please note (this is not complete code but everything above the code I'm posting works okay. It still prints the output, it's "prettified", but the \r and \n are still there. Have tried with find_all() but that doesn't work either.
mail.list() # Lists all labels in GMail
mail.select('INBOX/Personal') # Connected to inbox.
resp, items = mail.search(None, '(SEEN)')
items = items[0].split() # getting the mails id
for emailid in items:
# getting the mail content
resp, data = mail.fetch(emailid, '(UID BODY[TEXT])')
text = str(data[0]) # [1] don't forget to add this back
soup = bs(text, 'html.parser')
soup = soup.prettify()
soup = re.sub('\\r\\n', '', soup)
print(soup)
Recipe Objective - Working with specific strings using regular expression and beautiful soup? In order to work with strings, we will use the "re" python library which is used for regular expressions. Regular Expression (regex) - A regular expression, the regex method helps to match the specified string in the data.
find is used for returning the result when the searched element is found on the page. find_all is used for returning all the matches after scanning the entire document.
Beautiful Soup is a Python library that is used for web scraping purposes to pull the data out of HTML and XML files. It creates a parse tree from page source code that can be used to extract data in a hierarchical and more readable manner.
To use beautiful soup, you need to install it: $ pip install beautifulsoup4 . Beautiful Soup also relies on a parser, the default is lxml . You may already have it, but you should check (open IDLE and attempt to import lxml). If not, do: $ pip install lxml or $ apt-get install python-lxml .
You can use this for one line regex statement:
soup = re.sub('\\r*n*', '', soup)
or you can use this:
soup = re.sub('\\r', '', soup)
soup = re.sub('\\n', '', soup)
https://regexr.com/3nnp1
What about replace
command directly? Since it is not regex, it should be faster.
soup.replace("\n","").replace("\r","")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With