How do I get just the text content from a multipart email?

Question

    #!/usr/bin/php -q
    <?php
    $savefile = "savehere.txt";
    $sf = fopen($savefile, 'a') or die("can't open file");
    ob_start();

    // read from stdin
    $fd = fopen("php://stdin", "r");
    $email = "";
    while (!feof($fd)) {
        $email .= fread($fd, 1024);
    }
    fclose($fd);
    // handle email
    $lines = explode("
", $email);

    // empty vars
    $from = "";
    $subject = "";
    $headers = "";
    $message = "";
    $splittingheaders = true;

    for ($i=0; $i < count($lines); $i++) {
        if ($splittingheaders) {
            // this is a header
            $headers .= $lines[$i]."
";

            // look out for special headers
            if (preg_match("/^Subject: (.*)/", $lines[$i], $matches)) {
                $subject = $matches[1];
            }
            if (preg_match("/^From: (.*)/", $lines[$i], $matches)) {
                $from = $matches[1];
            }
            if (preg_match("/^To: (.*)/", $lines[$i], $matches)) {
                $to = $matches[1];
            }
        } else {
            // not a header, but message
            $message .= $lines[$i]."
";




        }

        if (trim($lines[$i])=="") {
            // empty line, header section has ended
            $splittingheaders = false;
        }
    }
/*$headers is ONLY included in the result at the last section of my question here*/
    fwrite($sf,"$message");
    ob_end_clean();
    fclose($sf);
    ?>

That is an example of my attempt. The problem is I am getting too much in the file. Here is what is being written to the file: (I just sent a bunch of garbage to it as you can see)

From xxxxxxxxxxxxx Tue Sep 07 16:26:51 2010
Received: from xxxxxxxxxxxxxxx ([xxxxxxxxxxx]:3184 helo=xxxxxxxxxxx)
    by xxxxxxxxxxxxx with esmtpa (Exim 4.69)
    (envelope-from <xxxxxxxxxxxxxxxx>)
    id 1Ot4kj-000115-SP
    for xxxxxxxxxxxxxxxxxxx; Tue, 07 Sep 2010 16:26:50 -0400
Message-ID: <EE3B7E26298140BE8700D9AE77CB339D@xxxxxxxxxxx>
From: "xxxxxxxxxxxxx" <xxxxxxxxxxxxxx>
To: <xxxxxxxxxxxxxxxxxxxxx>
Subject: stackoverflow is helping me
Date: Tue, 7 Sep 2010 16:26:46 -0400
MIME-Version: 1.0
Content-Type: multipart/alternative;
    boundary="----=_NextPart_000_0169_01CB4EA9.773DF5E0"
X-Priority: 3
X-MSMail-Priority: Normal
Importance: Normal
X-Mailer: Microsoft Windows Live Mail 14.0.8089.726
X-MIMEOLE: Produced By Microsoft MimeOLE V14.0.8089.726

This is a multi-part message in MIME format.

------=_NextPart_000_0169_01CB4EA9.773DF5E0
Content-Type: text/plain;
    charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

111
222
333
444
------=_NextPart_000_0169_01CB4EA9.773DF5E0
Content-Type: text/html;
    charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META content=3Dtext/html;charset=3Diso-8859-1 =
http-equiv=3DContent-Type>
<META name=3DGENERATOR content=3D"MSHTML 8.00.6001.18939"></HEAD>
<BODY style=3D"PADDING-LEFT: 10px; PADDING-RIGHT: 10px; PADDING-TOP: =
15px"=20
id=3DMailContainerBody leftMargin=3D0 topMargin=3D0 =
CanvasTabStop=3D"true"=20
name=3D"Compose message area">
<DIV><FONT face=3DCalibri>111</FONT></DIV>
<DIV><FONT face=3DCalibri>222</FONT></DIV>
<DIV><FONT face=3DCalibri>333</FONT></DIV>
<DIV><FONT face=3DCalibri>444</FONT></DIV></BODY></HTML>

------=_NextPart_000_0169_01CB4EA9.773DF5E0--

I found this while searching around but have no idea how to implement or where to insert in my code or if it would work.

preg_match("/boundary=\".*?\"/i", $headers, $boundary);
$boundaryfulltext = $boundary[0];

if ($boundaryfulltext!="")
{
$find = array("/boundary=\"/i", "/\"/i");
$boundarytext = preg_replace($find, "", $boundaryfulltext);
$splitmessage = explode("--" . $boundarytext, $message);
$fullmessage = ltrim($splitmessage[1]);
preg_match('/

(.*)/is', $fullmessage, $splitmore);

if (substr(ltrim($splitmore[0]), 0, 2)=="--")
{
$actualmessage = $splitmore[0];
}
else
{
$actualmessage = ltrim($splitmore[0]);
}

}
else
{
$actualmessage = ltrim($message);
}

$clean = array("/
--.*/is", "/=3D
.*/s");
$cleanmessage = trim(preg_replace($clean, "", $actualmessage));

So, how can I get just the plain text area of the email into my file or script for furthr handling??

Thanks in advance. stackoverflow is great!

Daniel Vandersluis · Accepted Answer

There are four steps that you will have to take in order to isolate the plain text part of your email body:

1. Get the MIME boundary string

We can use a regular expression to search your headers (let's assume they're in a separate variable, $headers):

$matches = array();
preg_match('#Content-Type: multipart/[^;]+;\s*boundary="([^"]+)"#i', $headers, $matches);
list(, $boundary) = $matches;

The regular expression will search for the Content-Type header that contains the boundary string, and then capture it into the first capture group. We then copy that capture group into variable $boundary.

2. Split the email body into segments

Once we have the boundary, we can split the body into its various parts (in your message body, the body will be prefaced by -- each time it appears). According to the MIME spec, everything before the first boundary should be ignored.

$email_segments = explode('--' . $boundary, $message);
array_shift($email_segments); // drop everything before the first boundary

This will leave us with an array containing all the segments, with everything before the first boundary ignored.

3. Determine which segment is plain text.

The segment that is plain text will have a Content-Type header with the MIME-type text/plain. We can now search each segment for the first segment with that header:

foreach ($email_segments as $segment)
{
  if (stristr($segment, "Content-Type: text/plain") !== false)
  {
    // We found the segment we're looking for!
  }
}

Since what we're looking for is a constant, we can use stristr (which finds the first instance of a substring in a string, case insensitively) instead of a regular expression. If the Content-Type header is found, we've got our segment.

4. Remove any headers from the segment

Now we need to remove any headers from the segment we found, as we only want the actual message content. There are four MIME headers that can appear here: Content-Type as we saw before, Content-ID, Content-Disposition and Content-Transfer-Encoding. Headers are terminated by so we can use that to determine the end of the headers:

$text = preg_replace('/Content-(Type|ID|Disposition|Transfer-Encoding):.*?
/is', "", $segment);

The s modifier at the end of the regular expression makes the dot match any newlines. .*? will collect as few characters as possible (ie. everything up to ); the ? is a lazy modifier on .*.

And after this point, $text will contain your email message content.

So to put it all together with your code:

<?php
// read from stdin
$fd = fopen("php://stdin", "r");
$email = "";
while (!feof($fd))
{
    $email .= fread($fd, 1024);
}
fclose($fd);

$matches = array();
preg_match('#Content-Type: multipart/[^;]+;\s*boundary="([^"]+)"#i', $email, $matches);
list(, $boundary) = $matches;

$text = "";
if (isset($boundary) && !empty($boundary)) // did we find a boundary?
{
  $email_segments = explode('--' . $boundary, $email);

  foreach ($email_segments as $segment)
  {
    if (stristr($segment, "Content-Type: text/plain") !== false)
    {
      $text = trim(preg_replace('/Content-(Type|ID|Disposition|Transfer-Encoding):.*?
/is', "", $segment));
      break;
    }
  }
}

// At this point, $text will either contain your plain text body,
// or be an empty string if a plain text body couldn't be found.

$savefile = "savehere.txt";
$sf = fopen($savefile, 'a') or die("can't open file");
fwrite($sf, $text);
fclose($sf);
?>

How do I get just the text content from a multipart email?

Tags:

php

email-parsing

Jimbo

1 Answers

Daniel Vandersluis

Recent Activity

Donate For Us

How do I get just the text content from a multipart email?

Tags:

php

email-parsing

Jimbo

1 Answers

Daniel Vandersluis

Related questions

Recent Activity

Donate For Us