Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to programmatically 'clean' emails?

Does anyone have any suggestions as to how I can clean the body of incoming emails? I want to strip out disclaimers, images and maybe any previous email text that may be also be present so that I am left with just the body text content. My guess is it isn't going to be possible in any reliable way, but has anyone tried it? Are there any libraries geared towards this sort of thing?

like image 849
flesh Avatar asked Dec 23 '22 13:12

flesh


2 Answers

In email, there is couple of agreed markings that mean something you wish to strip. You can look for these lines using regular expressions. I doubt you can't really well "sanitize" your emails, but some things you can look for:

  1. Line starting with "> " (greater than then whitespace) marks a quote
  2. Line with "-- " (two hyphens then whitespace then linefeed) marks the beginning of a signature, see Signature block on Wikipedia
  3. Multipart messages, boundaries start with --, beyond that you need to do some searching to separate the message body parts from unwanted parts (like base64 images)

As for an actual C# implementation, I leave that for you or other SOers.

like image 80
Tuminoid Avatar answered Dec 25 '22 04:12

Tuminoid


A few obvious things to look at:

  1. if the mail is anything but pure plain text, the message will be multi-part mime. Any part whose type is "image/*" (image/jpeg, etc), can probably be dropped. In all likelyhood any part whose type is not "text/*" can go.
  2. A HTML message will probably have a part of type "multipart/alternative" (I think), and will have 2 parts, one "text/plain" and one "text/html". The two parts should be just about equivalent, so you can drop the HTML part. If the only part present is the HTML bit, you may have to do a HTML to plain text conversion.
  3. The usual format for quoted text is to precede the text by a ">" character. You should be able to drop these lines, unless the line starts ">From", in which case the ">" has been inserted to prevent the mail reader from thinking that the "From " is the start of a new mail.
  4. The signature should start with "-- \r\n", though there is a very good chance that the trailing space will be missing.
like image 34
Simon Callan Avatar answered Dec 25 '22 02:12

Simon Callan