Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Coding a Gmail style "hide quoted text" for web based mailing list archive

I'm working on a web application that parses and displays email messages in a threaded format (among other things). Emails may come from any number of different mail clients, and in either text or HTML format.

Given that most people have a tendency to top post, I'd like to be able to hide the duplicated message in an email reply in a manner similar to how Gmail does it (e.g. "show quoted text").

Determining which part of the message is the reply is somewhat challenging. Personally, I use "> " delimiters at the beginning of the quoted text when replying. I created a regexp that looks for these lines and wraps a div around them to allow some JS to hide or show this block of text.

I then noticed that Outlook doesn't use the "> " characters by default, it simply adds a header block above the reply with the summary of the headers (From, Subject, Date, etc.). The reply is untouched. I can match on this and hide the rest of the email, working with the assumption that it's a top quote.

I then looked at Thunderbird, and it uses "> " for text, and <blockquote> for HTML mails. I still haven't looked at what Apple Mail does, what Notes does, or what any of the other millions of mail clients out there do.

Will I be writing a special case regexp for every single client out there? or is there something I'm missing?

Any suggestions, sample code or pointers to third party libraries much appreciated!

like image 232
Darren Avatar asked Feb 11 '09 06:02

Darren


2 Answers

It'll be pretty hard to duplicate the way gmail does it since it doesn't care about whether it was a quoted piece or not, like Zac says, it just seems to care about the diff.

Its actually pretty hard to get this right 100% of the time. Plain text email is "lossy", its entirely possible for you to send

> Here is my long line that is over 74 chars (email line length limit)

Which can get encoded as something like

> Here is my long line that is over 74 chars (email=
 line length limit)

And then is decoded as

> Here is my long line that is over 74 chars (email
line length limit)

Making it indistinguishable from an inline-reply.

This is email, so variations are abound. Email usually line-wraps at something like 74 characters, and encoding schemes can differ. Its a real PITA. If you can access the HTML version, you will probably have better luck looking for quote tags and the like. Another idea would be to parse both the plain text and html version to try and determine the boundries.

Additionally, its best to just plan for specific client hacks. They all construct mime messages differently, both in structure and header content.

Edit: I say this with the experience of writing an email processing system as well as seeing several people try to do the -exact- thing you're doing. It always only got "ok" results.

like image 121
Richard Levasseur Avatar answered Sep 21 '22 11:09

Richard Levasseur


From what I can tell, gmail does not bother about prefixed lines or section headings, except to ignore them. If the text lines appeared earlier in the thread, and then reappear, it is considered to be quoted. Thus, e.g., if you send multiple messages and don't change your signature, the signature is considered to be quoted. If you've already dealt with the '>' prefix, a simple diff should do most of the rest. No need to get fancy.

like image 35
Zac Thompson Avatar answered Sep 22 '22 11:09

Zac Thompson