Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

parsing raw email in php

Tags:

php

email

I'm looking for good/working/simple to use PHP code for parsing raw email into parts.

I've written a couple of brute force solutions, but every time, one small change/header/space/something comes along and my whole parser fails and the project falls apart.

And before I get pointed at PEAR/PECL, I need actual code. My host has some screwy config or something, I can never seem to get the .so's to build right. If I do get the .so made, some difference in path/environment/php.ini doesn't always make it available (apache vs cron vs CLI).

Oh, and one last thing, I'm parsing the raw email text, NOT POP3, and NOT IMAP. It's being piped into the PHP script via a .qmail email redirect.

I'm not expecting SOF to write it for me, I'm looking for some tips/starting points on doing it "right". This is one of those "wheel" problems that I know has already been solved.

like image 403
Uberfuzzy Avatar asked Aug 15 '08 23:08

Uberfuzzy


2 Answers

What are you hoping to end up with at the end? The body, the subject, the sender, an attachment? You should spend some time with RFC2822 to understand the format of the mail, but here's the simplest rules for well formed email:

HEADERS\n \n BODY 

That is, the first blank line (double newline) is the separator between the HEADERS and the BODY. A HEADER looks like this:

HSTRING:HTEXT 

HSTRING always starts at the beginning of a line and doesn't contain any white space or colons. HTEXT can contain a wide variety of text, including newlines as long as the newline char is followed by whitespace.

The "BODY" is really just any data that follows the first double newline. (There are different rules if you are transmitting mail via SMTP, but processing it over a pipe you don't have to worry about that).

So, in really simple, circa-1982 RFC822 terms, an email looks like this:

HEADER: HEADER TEXT HEADER: MORE HEADER TEXT   INCLUDING A LINE CONTINUATION HEADER: LAST HEADER  THIS IS ANY ARBITRARY DATA (FOR THE MOST PART) 

Most modern email is more complex than that though. Headers can be encoded for charsets or RFC2047 mime words, or a ton of other stuff I'm not thinking of right now. The bodies are really hard to roll your own code for these days to if you want them to be meaningful. Almost all email that's generated by an MUA will be MIME encoded. That might be uuencoded text, it might be html, it might be a uuencoded excel spreadsheet.

I hope this helps provide a framework for understanding some of the very elemental buckets of email. If you provide more background on what you are trying to do with the data I (or someone else) might be able to provide better direction.

like image 76
jj33 Avatar answered Sep 21 '22 06:09

jj33


Try the Plancake PHP Email parser: https://github.com/plancake/official-library-php-email-parser

I have used it for my projects. It works great, it is just one class and it is open source.

like image 31
Dan Avatar answered Sep 22 '22 06:09

Dan