Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert inline specified UTF-8 mail subject

want to convert the following raw mail subject to normal UTF-8 text:

=?utf-8?Q?Schuker_hat_sich_vom_=C3=9Cbungsabend_(01.01.2012)_abgem?= =?utf-8?Q?eldet?=

The real text for that is:

Schuker hat sich vom Übungsabend (01.01.2012) abgemeldet

My first approach to convert this:

$mime = '=?utf-8?Q?Schuker_hat_sich_vom_=C3=9Cbungsabend_(01.01.2012)_abgem?=  =?utf-8?Q?eldet?=';
mb_internal_encoding("UTF-8");
echo mb_decode_mimeheader($mime);

This gives me the following result:

Schuker_hat_sich_vom_Übungsabend_(01.01.2012)_abgemeldet

(Questions here: What am I doing wrong? Why do those underscores occur?)

My second approach to convert this:

$mime = '=?utf-8?Q?Schuker_hat_sich_vom_=C3=9Cbungsabend_(01.01.2012)_abgem?=  =?utf-8?Q?eldet?=';
echo imap_utf8($mime);

This gives me the following (correct) result:

Schuker hat sich vom Übungsabend (01.01.2012) abgemeldet

Why does this work? On which method should I rely on?

The reason I ask is that I previously asked another mail subject decoding related question where mb_decode_mimeheader was the solution whereas here imap_utf8 would be the way to go. How can I ensure to decode everything correct for those both examples:

=?utf-8?Q?Schuker_hat_sich_vom_=C3=9Cbungsabend_(01.01.2012)_abgem?= =?utf-8?Q?eldet?

and

=?UTF-8?B?UmU6ICMyLUZpbmFsIEFjY2VwdGFuY2UgdGVzdCB3aXRoIG5ldyB0ZXh0IHdpdGggU2xvdg==?= =?UTF-8?B?YWsgaW50ZXJwdW5jdGlvbnMgIivEvsWhxI3FpcW+w73DocOtw6khxYgi?=

Should give me the expected results:

Schuker hat sich vom Übungsabend (01.01.2012) abgemeldet

and

Re: #2-Final Acceptance test with new text with Slovak interpunctions "+ľščťžýáíé!ň"

like image 916
hbit Avatar asked Feb 19 '12 16:02

hbit


People also ask

What does UTF-8 mean on mail?

UTF-8 is a variable-width character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit.

How do you convert to UTF?

Click Tools, then select Web options. Go to the Encoding tab. In the dropdown for Save this document as: choose Unicode (UTF-8). Click Ok.

Can UTF-8 handle all characters?

UTF-8 extends the ASCII character set to use 8-bit code points, which allows for up to 256 different characters. This means that UTF-8 can represent all of the printable ASCII characters, as well as the non-printable characters.

How do you convert a string to UTF-8 in Python?

Use encode() to convert a String to UTF-8 The encode() method returns the encoded version of the string. In case of failure, a UnicodeDecodeError exception may occur.


4 Answers

Based on the hbit response, I've improved the imapUtf8() function to convert the subject text to UTF-8 using the charset information. The result is something like:

function imapUtf8($str){
    $convStr = '';
    $subLines = preg_split('/[\r\n]+/', $str);
    for ($i=0; $i < count($subLines); $i++) {
        $convLine = '';
        $linePartArr = imap_mime_header_decode($subLines[$i]);
        for ($j=0; $j < count($linePartArr); $j++) {
            if ($linePartArr[$j]->charset === 'default') {
                if ($linePartArr[$j]->text != " ") {
                    $convLine .= ($linePartArr[$j]->text);
                }
            } else {
                $convLine .= iconv($linePartArr[$j]->charset, 'UTF-8', $linePartArr[$j]->text);
            }
        }
        $convStr .= $convLine;
    }

    return $convStr;
}
like image 78
Gabriel Gcia Fdez Avatar answered Oct 06 '22 07:10

Gabriel Gcia Fdez


This function works for both examples:

function imapUtf8($str){
    $convStr = '';
    $subLines = preg_split('/[\r\n]+/',$str); // split multi-line subjects
    for($i=0; $i < count($subLines); $i++){ // go through lines
        $convLine = '';
        $linePartArr = imap_mime_header_decode(trim($subLines[$i])); // split and decode by charset
        for($j=0; $j < count($linePartArr); $j++){
            $convLine .= ($linePartArr[$j]->text); // append sub-parts of line together
        }
        $convStr .= $convLine; // append to whole subject
    }
    return $convStr; // return converted subject
} 

Tests:

$sub1 = '=?utf-8?Q?Schuker_hat_sich_vom_=C3=9Cbungsabend_(01.01.2012)_abgem?=  =?utf-8?Q?eldet?=';
$sub2 = '=?UTF-8?B?UmU6ICMyLUZpbmFsIEFjY2VwdGFuY2UgdGVzdCB3aXRoIG5ldyB0ZXh0IHdpdGggU2xvdg==?= =?UTF-8?B?YWsgaW50ZXJwdW5jdGlvbnMgIivEvsWhxI3FpcW+w73DocOtw6khxYgi?=';
echo imapUtf8($sub1);
echo imapUtf8($sub2);

Result:

Schuker hat sich vom Übungsabend (01.01.2012) abgemeldet

Re: #2-Final Acceptance test with new text with Slovak interpunctions "+ľščťžýáíé!ň"

like image 32
hbit Avatar answered Oct 06 '22 09:10

hbit


It's also in the comments in the manual for mb_decode_mimeheader, and I actually assume it is a bug. None in the database, so I'd file it as a new one.

However, AFAIK imap_mime_header_decode will cope with both your encodings without a problem, so that will keep your code going.

like image 43
Wrikken Avatar answered Oct 06 '22 08:10

Wrikken


About the mysterious underscore in the Subject header field:

RFC2047 4.2(2) states explicitly:

The 8-bit hexadecimal value 20 (e.g., ISO-8859-1 SPACE) may be represented as "_" (underscore, ASCII 95.). (This character may not pass through some internetwork mail gateways, but its use will greatly enhance readability of "Q" encoded data with mail readers that do not support this encoding.) Note that the "_" always represents hexadecimal 20, even if the SPACE character occupies a different code position in the character set in use.

The encoding rule for Subject line is documented in the very RFC2047 .

like image 44
Jimm Chen Avatar answered Oct 06 '22 09:10

Jimm Chen