Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

beautiful soup regex

I just ran the following code in Python to take all of the certain emails out of an IMAP folder. The extraction part works fine and the BeautifulSoup part works okay, but the output has a lot of '\r' and '\n' within.

I tried to remove these with REGEX sub function but it's not working...not even giving an error message. Any idea what is wrong? I am attaching the code...please note (this is not complete code but everything above the code I'm posting works okay. It still prints the output, it's "prettified", but the \r and \n are still there. Have tried with find_all() but that doesn't work either.

mail.list()  # Lists all labels in GMail
mail.select('INBOX/Personal')  # Connected to inbox.

resp, items = mail.search(None, '(SEEN)')

items = items[0].split()  # getting the mails id        
for emailid in items:
    # getting the mail content
    resp, data = mail.fetch(emailid, '(UID BODY[TEXT])')
    text = str(data[0])  # [1] don't forget to add this back
    soup = bs(text, 'html.parser')
    soup = soup.prettify()
    soup = re.sub('\\r\\n', '', soup)

print(soup)
like image 746
Obie Avatar asked Apr 11 '18 08:04

Obie


People also ask

Can I use regex in BeautifulSoup?

Recipe Objective - Working with specific strings using regular expression and beautiful soup? In order to work with strings, we will use the "re" python library which is used for regular expressions. Regular Expression (regex) - A regular expression, the regex method helps to match the specified string in the data.

What is the difference between Find_all () and find () in BeautifulSoup?

find is used for returning the result when the searched element is found on the page. find_all is used for returning all the matches after scanning the entire document.

What is BeautifulSoup in Python?

Beautiful Soup is a Python library that is used for web scraping purposes to pull the data out of HTML and XML files. It creates a parse tree from page source code that can be used to extract data in a hierarchical and more readable manner.

How do you use BeautifulSoup 4 in Python?

To use beautiful soup, you need to install it: $ pip install beautifulsoup4 . Beautiful Soup also relies on a parser, the default is lxml . You may already have it, but you should check (open IDLE and attempt to import lxml). If not, do: $ pip install lxml or $ apt-get install python-lxml .


2 Answers

You can use this for one line regex statement:

soup = re.sub('\\r*n*', '', soup)

or you can use this:

soup = re.sub('\\r', '', soup)
soup = re.sub('\\n', '', soup)

https://regexr.com/3nnp1

like image 93
MasOOd.KamYab Avatar answered Nov 18 '22 13:11

MasOOd.KamYab


What about replace command directly? Since it is not regex, it should be faster.

soup.replace("\n","").replace("\r","")
like image 36
silgon Avatar answered Nov 18 '22 12:11

silgon