Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove the quoted text from an email and only show the new text

I am parsing emails. When I see a reply to an email, I would like to remove the quoted text so that I can append the text to the previous email (even if its a reply).

Typically, you'll see this:

1st email (start of conversation)

This is the first email

2nd email (reply to first)

This is the second email

Tim said:
This is the first email

The output of this would be "This is the second email" only. Although different email clients quote text differently, if there were someway to get mostly the new email text only, that would also be acceptable.

like image 433
Tim Avatar asked Mar 05 '10 08:03

Tim


People also ask

How do I remove quoted content from email?

To do this, click and drag through the text to select it. Then click the Quote icon in the message body tools. This will offset the selected text with the vertical line. Note: To remove the quoted text effect, select the offset text, and click the Indent Less icon in the message body tools.

How do I remove a quote in Gmail?

Removing Quotations You can simply place your cursor at the very start of the quotation and press the backspace key. For this you may need to press the backspace key several times to bring the quoted text onto the same line as whatever the previous line of text is.

What does quoted text hidden mean in an email?

The quoted text that is sent along with each reply is hidden by default. Since you're already in a conversation, you don't really need it - you can expand each message in the conversation to view its content instead of having to untangle pages of quoted, indented text.

What does it mean when it says show quoted text?

Clicking the From name or Subject Line lists each email – but only the top email will display its contents. The content of the other emails is not visible because Gmail does not detect that the “conversations” have changed overtly. To see the content, simply click “Show Quoted Text” in each email.


2 Answers

I use the following regex(s) to match the lead in for quoted text (the last one is the one that counts):

  /** general spacers for time and date */
  private static final String spacers = "[\\s,/\\.\\-]";

  /** matches times */
  private static final String timePattern  = "(?:[0-2])?[0-9]:[0-5][0-9](?::[0-5][0-9])?(?:(?:\\s)?[AP]M)?";

  /** matches day of the week */
  private static final String dayPattern   = "(?:(?:Mon(?:day)?)|(?:Tue(?:sday)?)|(?:Wed(?:nesday)?)|(?:Thu(?:rsday)?)|(?:Fri(?:day)?)|(?:Sat(?:urday)?)|(?:Sun(?:day)?))";

  /** matches day of the month (number and st, nd, rd, th) */
  private static final String dayOfMonthPattern = "[0-3]?[0-9]" + spacers + "*(?:(?:th)|(?:st)|(?:nd)|(?:rd))?";

  /** matches months (numeric and text) */
  private static final String monthPattern = "(?:(?:Jan(?:uary)?)|(?:Feb(?:uary)?)|(?:Mar(?:ch)?)|(?:Apr(?:il)?)|(?:May)|(?:Jun(?:e)?)|(?:Jul(?:y)?)" +
                                              "|(?:Aug(?:ust)?)|(?:Sep(?:tember)?)|(?:Oct(?:ober)?)|(?:Nov(?:ember)?)|(?:Dec(?:ember)?)|(?:[0-1]?[0-9]))";

  /** matches years (only 1000's and 2000's, because we are matching emails) */
  private static final String yearPattern  = "(?:[1-2]?[0-9])[0-9][0-9]";

  /** matches a full date */
  private static final String datePattern     = "(?:" + dayPattern + spacers + "+)?(?:(?:" + dayOfMonthPattern + spacers + "+" + monthPattern + ")|" +
                                                "(?:" + monthPattern + spacers + "+" + dayOfMonthPattern + "))" +
                                                 spacers + "+" + yearPattern;

  /** matches a date and time combo (in either order) */
  private static final String dateTimePattern = "(?:" + datePattern + "[\\s,]*(?:(?:at)|(?:@))?\\s*" + timePattern + ")|" +
                                                "(?:" + timePattern + "[\\s,]*(?:on)?\\s*"+ datePattern + ")";

  /** matches a leading line such as
   * ----Original Message----
   * or simply
   * ------------------------
   */
  private static final String leadInLine    = "-+\\s*(?:Original(?:\\sMessage)?)?\\s*-+\n";

  /** matches a header line indicating the date */
  private static final String dateLine    = "(?:(?:date)|(?:sent)|(?:time)):\\s*"+ dateTimePattern + ".*\n";

  /** matches a subject or address line */
  private static final String subjectOrAddressLine    = "((?:from)|(?:subject)|(?:b?cc)|(?:to))|:.*\n";

  /** matches gmail style quoted text beginning, i.e.
   * On Mon Jun 7, 2010 at 8:50 PM, Simon wrote:
   */
  private static final String gmailQuotedTextBeginning = "(On\\s+" + dateTimePattern + ".*wrote:\n)";


  /** matches the start of a quoted section of an email */
  private static final Pattern QUOTED_TEXT_BEGINNING = Pattern.compile("(?i)(?:(?:" + leadInLine + ")?" +
                                                                        "(?:(?:" +subjectOrAddressLine + ")|(?:" + dateLine + ")){2,6})|(?:" +
                                                                        gmailQuotedTextBeginning + ")"
                                                                      );

I know that in some ways this is overkill (and might be slow!) but it works pretty well. Please let me know if you find anything that doesn't match this so I can improve it!

like image 67
smurthas Avatar answered Oct 03 '22 19:10

smurthas


Check out the google patent on this: http://www.google.com/patents/US7222299

In summary they hash portions of the text (presumably something like sentences) and then look for matches to hashes in the previous messages. Super fast and they probably use this as input to the threading algorithm too. What a great idea!

like image 22
adam.lofts Avatar answered Oct 03 '22 20:10

adam.lofts