Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract email message itself from all its prior messages and meta data (Sendgrid Parse API/PHP)?

I'm using Sendgrid and their Parse API to send/receive email. The Parse API allows one's web app to receive email as a $_POST but the problem is that in the $_POST I want to be able to extract the message itself from its prior messages and meta data that get chained along.

To show you what I mean in the following picture, i'd just like to capture the text, "trying sending from 12373 to 12373 from GMAIL" and not all the junk below it. If that is not possible, does anyone have any suggestions on how to parse the email body ($_POST['text']) such that I can separate out the message itself?

The problem is see is that depending on the email client (gmail, outlook, etc.), It's not clear to me that the date information, in this case: "On Wed, Jan 23, 2013...", will allows follow the message itself. If all email client's put the date beneath the message, then it would seem I could design a fancy regex to look for a line break followed by a date or something. Thoughts?

**Entire** Message body containing prior messages

like image 848
tim peterson Avatar asked Feb 17 '13 00:02

tim peterson


People also ask

How do I export emails from SendGrid?

In the top right corner of the activity feed, click Export CSV. This triggers an email to the email address associated with your SendGrid account. Open the email and then click Download.

What is SendGrid inbound parse?

SendGrid's Inbound Parse Webhook allows you to receive emails that get automatically broken apart by SendGrid and then sent to a URL of your choosing. SendGrid will grab the content, attachments, and the headers from any email it receives for your specified hostname.

Can SendGrid read emails?

SendGrid is not only great for sending emails, but they can also process incoming emails. The Inbound Parse WebHook processes all incoming emails for a specific domain that is set in your DNS, parses the contents and the attachments and POSTs them as multipart/form-data to the defined URL.

Does SendGrid store email content?

We retain email message activity/metadata (such as opens and clicks) for 30 days. We store customer's aggregated sending stats and suppression lists (bounces, unsubscribes) and spam reports (which may contain content) indefinitely, and we store minimal random content samples for 61 days.


2 Answers

You have a couple of options:

1) Insert a token that splits the emails

You could do something like --- reply above this line --- and then cut out everything below that token.

2) Use an email reply parsing library

There is a really good one done by github, but it's in ruby. There's a php port though that might be good for what you need:

Fully working code:

<?php
  require_once 'application/third_party/EmailReplyParser-master/src/autoload.php';
  $email = new \EmailReplyParser\Email();
  $reply = $email->read($_POST['text']);            
  $message=$reply[0]->getContent();
  $message=preg_replace('~On(.*?)wrote:(.*?)$~si', '', $message); 
  //Last line is needed for some email clients, e.g., some university e-mails: [email protected] but not Gmail or Hotmail, to get rid of "On Jan 23...wrote:" 
  //This failure to remove "On Jan 23...wrote:" is a known issue and is documented in their README

 ?>
like image 188
Swift Avatar answered Oct 13 '22 00:10

Swift


There's simply no guaranteed way to parse quoted message threads from an email message, so you won't find a regex or any other code that will work in all cases. There's no standard to define formatting of replies, and as you've already observed different mail clients use different conventions. Many, in fact, will allow the user to edit the quoted text. Also, users can paste in unrelated messages, with or without headers, resulting in a mix-and-match of formats.

If you can record and keep the history of all messages as they are sent and received, then you can (usually, but not always) use the In-Reply-To header (see RFC-5322) to locate the previous message by matching it's Message-ID header, and do a diff on the body and remove duplicate text runs. It's apparent that some email systems do this to improve their presentations, but I'm not aware of any available open source code.

like image 44
Richard Schwartz Avatar answered Oct 12 '22 23:10

Richard Schwartz