I'm trying to write application that periodically receives e-mails. It writes every mail into database. But sometimes i'm getting 'Re:' e-mail that looks something like this:
New message
On September 21, 2010 24:26 Someone wrote (a):
| Old message |
The format depends on e-mail provider.
Is there any library that helps removing 'Re' part from e-mail message? Maybe IMAP server can do that? I have all the previous e-mails from thread in database so I can take them and search in new message.
'Bcc' (blind carbon copy) Using the 'Bcc' field, you can hide the visibility of the email recipients.
Open the received email. Click the Reply arrow down and select Forward. Highlights the parts that you want to remove then press Delete on your keyboard.
By default, Outlook includes the entire text of the original message when you reply to it. To turn this off and reply with a blank slate, click Tools, Options, and E-Mail Options. Under When Replying To A Message, select Do Not Include Original Message.
Step 1: Click to open a mail folder that the email threads stay in. Step 2: Check the Show as Conversations on the View tab. Step 3: In the popping up dialog box, click the All mailboxes button or This folder button. Step 4: Select the conversations that you will clean up from the mail list.
If you are able to associate a reply (RE:) message with the original/previous message that it is a reply to, then I would think that you could grab the body text of the original/previous message from your database, and then remove that text from the body of the reply. However, this method will not be 100% accurate, because clients could convert an HTML/Rich Text email in to plain text, or vice-versa. In any such case, this method probably wouldn't work. Even so, this technique would be generic and probably work the majority of the time.
In addition, the email provider may add certain header fields, or preambles, to the beginnings of a quoted message in a reply. In this case, I don't think there is any "catch all" solution.
My recommendation would be to target a few of the really huge web-mail providers (Gmail, Yahoo, Microsoft, etc), learn the formats that they use for their replies and parse the messages accordingly. In addition, you could likely handle a few generic formats as well. For instance, the '>' character is commonly used at the beginning of each line of quoted text in a reply.
If you're going to be developing in a language like C#, create yourself an Interface like IReplyFormat
, with a corresponding implementation for each provider, and possibly some generic formats.
I don't think you will find any catch-all/perfect solution to this problem, as there are simply too many mail providers with too many different formats. However, I think you can at the very least find some techniques, like the ones mentioned above, that will work for you more times than not, which is the best you can hope for at this point.
Personally I think that you are out of luck here, as the message copy is part of the body. So in order to remove it you will have to process the message's body and write an extraction method for each known format (obviously the problem is that you cannot know all possible formats).
So, instead of parsing the body why don't you persist the whole message into the database? Normally the size of the message should not be the problem with modern DBMS. If it really is a problem you always can compress the body and store it in a BLOB.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With