I'm creating a basic system that allows users to reply to a thread on the website via email. However, most email clients include the text of the previous emails in their reply emails. This text is unwanted on the website.
Is there a reliable way in which I can extract only the new message, without prior knowledge about the earlier emails? I'm using the email
class of Python.
Content-Type: text/plain; charset=ISO-8859-1
test message! This is the part I want.
On Thu, Mar 24, 2011 at 3:51 PM, <[email protected]> wrote:
> Hi!
>
> Herman just posted a comment on the website:
>
>
> From: Herman
> "Hi there! I might be interested"
>
>
> Regards,
> The Website Team
> http://www.test.com
>
This is a reply message from gmail, I'm sure other clients might do it differently. A good start would probably be to ignore the lines that start with >
, but there could also be lines like that in between the new message, and then they probably should be kept. I'll also still have the content-type line and the date line.
The answer @LAMRIN TAWSRAS gave will work for parsing the text before the Gmail date expression only if a match is found, otherwise an error will be thrown. Also, there isn't a need to search the entire message for multiple date expressions, you just need the first one found. Therefore, I would refine his solution to use re.search()
:
def get_body_before_gmail_reply_date(msg):
body_before_gmail_reply = msg
# regex for date format like "On Thu, Mar 24, 2011 at 3:51 PM"
matching_string_obj = re.search(r"\w+\s+\w+[,]\s+\w+\s+\d+[,]\s+\d+\s+\w+\s+\d+[:]\d+\s+\w+.*", msg)
if matching_string_obj:
# split on that match, group() returns full matched string
body_before_gmail_reply_list = msg.split(matching_string_obj.group())
# string before the regex match, so the body of the email
body_before_gmail_reply = body_before_gmail_reply_list[0]
return body_before_gmail_reply
I think this should work
import re
string_list = re.findall(r"\w+\s+\w+[,]\s+\w+\s+\d+[,]\s+\d+\s+\w+\s+\d+[:]\d+\s+\w+.*", strings) # regex for On Thu, Mar 24, 2011 at 3:51 PM
res = strings.split(string_list[0]) # split on that match
print(res[0]) # get before string of the regex
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With