Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting mail's content

Tags:

php

email

imap

I need to create an app that will extract VAT numbers that our clients send us for verification. They send nothing more with e-mails. That's for purpose of creating extended statistics.

What I need is to have a mail's body without any headers before the content I need, that is VAT number, as simple as that.

This is my script that creates the list of 30 recent e-mails:

<?
if (!function_exists('imap_open')) { die('No function'); }

if ($mbox = imap_open(<confidential>)) {
    $output = "";
    $messageCount = imap_num_msg($mbox);
    $x = 1;     
    for ($i = 0; $i < 30; $i++) {
        $message_id = ($messageCount - $i);
        $fetch_message = imap_header($mbox, $message_id);
        $mail_content = quoted_printable_decode(imap_fetchbody($mbox,$message_id, 1));
        iconv(mb_detect_encoding($mail_content, mb_detect_order(), true), "UTF-8", $mail_content);

        $output .= "<tr>
        <td>".$x.".</td>
        <td>
            ".$fetch_message->from[0]->mailbox."@".$fetch_message->from[0]->host."
        </td>
        <td>
            ".$fetch_message->date."
        </td>
        <td>
            ".$fetch_message->subject."
        </td>
        <td>
            <textarea cols=\"40\">".$mail_content."</textarea>
        </td>
        </tr>";
        $x++;
    }
    $smarty->assign("enquiries", $output);
    $smarty->display("module_mail");
    imap_close($mbox);
} else {
    print_r(imap_errors());
}
?>

I've worked with imap_fetchbody, imap_header and so on to retrieve the desired content but it turns out that most of e-mails have got something else (like headers) before the content, ie.

--=-Dbl2eWTUl0Km+Tj46Ww1
Content-Type: text/plain;

------=_NextPart_001_003A_01D14F7A.F25AB3D0
Content-Type: text/plain;

--=-ucRIRGamiKb0Ot1/AkNc
Content-Type: text/plain;

I need to get rid of everything that's before the VAT number included in the mail's message but I don't know how. Some emails don't have these headers, some do. And since we're working with clients from all over the Europe, it really confuses me and leaves powerless.

Another problem is that some clients just copy-paste VAT numbers from various websites and that means these VAT numbers are often pasted with the original style (bold/background/changed colour et cetera). That might be the reason for my PS below.

I would appreciate every help that'd lead me to solving this problem.

Thank you in advance.

PS. Just for a record. With imap_fetchbody($mbox,$message_id, 1) I need to use 1 to have the whole content. Changing 1 to anything else results in displaying NO email content at all. Literally.

like image 303
Sates Avatar asked Jan 15 '16 11:01

Sates


People also ask

What does extract an email mean?

Email extractors search through different layers of the internet as well as offline sites and generate a file containing the email addresses it has collected. Some email extractors can be integrated with other applications to send out email messages to the large list of recipients.


1 Answers

The part of the email that you define as "noise" are just part of the format of the email.
In some way is like you were reading the html code of a web page.

All those bits are boundaries. Those elements of the email are like tags in the html and like html they start and they close.

So in your case:

Content-Type: multipart/alternative; boundary="=-Dbl2eWTUl0Km+Tj46Ww1" // define type of email structure and boudary

--=-Dbl2eWTUl0Km+Tj46Ww1    // used to start the section
Content-Type: text/plain;   // to define the type of content of the section
// here there is your VAT presumbly

--=-Dbl2eWTUl0Km+Tj46Ww1--  // used to close the section

Possibles solutions

Actually you have at least 2 solutions.
Make a custom parser by yourself or use a PECL library called Mailparse.

Manually make a parser:

$mail_lines = explode($mail_content, "\n");

foreach ($mail_lines as $key => $line) {
     // jump most of the headrs
     if ($key < 5) {
         continue;
     }

     // skip tag lines
     if (strpos($line, "--")) {
        continue;
     }

     // skip Content lines
     if (strpos($line, "Content")) {
        continue;
     }

     if (empty(trim($line))) {
        continue;
     } 

     ////////////////////////////////////////////////////
     // here you have to insert the logic for the parser
     // and extend the guard clauses
     ////////////////////////////////////////////////////
}

Mailparse:

Install Mail parse sudo pecl install mailparse .

Extract the VAT :

$mail = mailparse_msg_create();
mailparse_msg_parse($mail, $mail_content);
$struct = mailparse_msg_get_structure($mail); 

foreach ($struct as $st) { 
    $section = mailparse_msg_get_part($mail, $st); 
    $info = mailparse_msg_get_part_data($section); 

    print_r($info);
}
like image 103
borracciaBlu Avatar answered Oct 14 '22 17:10

borracciaBlu