want to convert the following raw mail subject to normal UTF-8 text:
=?utf-8?Q?Schuker_hat_sich_vom_=C3=9Cbungsabend_(01.01.2012)_abgem?= =?utf-8?Q?eldet?=
The real text for that is:
Schuker hat sich vom Übungsabend (01.01.2012) abgemeldet
My first approach to convert this:
$mime = '=?utf-8?Q?Schuker_hat_sich_vom_=C3=9Cbungsabend_(01.01.2012)_abgem?= =?utf-8?Q?eldet?=';
mb_internal_encoding("UTF-8");
echo mb_decode_mimeheader($mime);
This gives me the following result:
Schuker_hat_sich_vom_Übungsabend_(01.01.2012)_abgemeldet
(Questions here: What am I doing wrong? Why do those underscores occur?)
My second approach to convert this:
$mime = '=?utf-8?Q?Schuker_hat_sich_vom_=C3=9Cbungsabend_(01.01.2012)_abgem?= =?utf-8?Q?eldet?=';
echo imap_utf8($mime);
This gives me the following (correct) result:
Schuker hat sich vom Übungsabend (01.01.2012) abgemeldet
Why does this work? On which method should I rely on?
The reason I ask is that I previously asked another mail subject decoding related question where mb_decode_mimeheader
was the solution whereas here imap_utf8
would be the way to go. How can I ensure to decode everything correct for those both examples:
=?utf-8?Q?Schuker_hat_sich_vom_=C3=9Cbungsabend_(01.01.2012)_abgem?= =?utf-8?Q?eldet?
and
=?UTF-8?B?UmU6ICMyLUZpbmFsIEFjY2VwdGFuY2UgdGVzdCB3aXRoIG5ldyB0ZXh0IHdpdGggU2xvdg==?= =?UTF-8?B?YWsgaW50ZXJwdW5jdGlvbnMgIivEvsWhxI3FpcW+w73DocOtw6khxYgi?=
Should give me the expected results:
Schuker hat sich vom Übungsabend (01.01.2012) abgemeldet
and
Re: #2-Final Acceptance test with new text with Slovak interpunctions "+ľščťžýáíé!ň"
UTF-8 is a variable-width character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit.
Click Tools, then select Web options. Go to the Encoding tab. In the dropdown for Save this document as: choose Unicode (UTF-8). Click Ok.
UTF-8 extends the ASCII character set to use 8-bit code points, which allows for up to 256 different characters. This means that UTF-8 can represent all of the printable ASCII characters, as well as the non-printable characters.
Use encode() to convert a String to UTF-8 The encode() method returns the encoded version of the string. In case of failure, a UnicodeDecodeError exception may occur.
Based on the hbit response, I've improved the imapUtf8()
function to convert the subject text to UTF-8 using the charset information. The result is something like:
function imapUtf8($str){
$convStr = '';
$subLines = preg_split('/[\r\n]+/', $str);
for ($i=0; $i < count($subLines); $i++) {
$convLine = '';
$linePartArr = imap_mime_header_decode($subLines[$i]);
for ($j=0; $j < count($linePartArr); $j++) {
if ($linePartArr[$j]->charset === 'default') {
if ($linePartArr[$j]->text != " ") {
$convLine .= ($linePartArr[$j]->text);
}
} else {
$convLine .= iconv($linePartArr[$j]->charset, 'UTF-8', $linePartArr[$j]->text);
}
}
$convStr .= $convLine;
}
return $convStr;
}
This function works for both examples:
function imapUtf8($str){
$convStr = '';
$subLines = preg_split('/[\r\n]+/',$str); // split multi-line subjects
for($i=0; $i < count($subLines); $i++){ // go through lines
$convLine = '';
$linePartArr = imap_mime_header_decode(trim($subLines[$i])); // split and decode by charset
for($j=0; $j < count($linePartArr); $j++){
$convLine .= ($linePartArr[$j]->text); // append sub-parts of line together
}
$convStr .= $convLine; // append to whole subject
}
return $convStr; // return converted subject
}
Tests:
$sub1 = '=?utf-8?Q?Schuker_hat_sich_vom_=C3=9Cbungsabend_(01.01.2012)_abgem?= =?utf-8?Q?eldet?=';
$sub2 = '=?UTF-8?B?UmU6ICMyLUZpbmFsIEFjY2VwdGFuY2UgdGVzdCB3aXRoIG5ldyB0ZXh0IHdpdGggU2xvdg==?= =?UTF-8?B?YWsgaW50ZXJwdW5jdGlvbnMgIivEvsWhxI3FpcW+w73DocOtw6khxYgi?=';
echo imapUtf8($sub1);
echo imapUtf8($sub2);
Result:
Schuker hat sich vom Übungsabend (01.01.2012) abgemeldet
Re: #2-Final Acceptance test with new text with Slovak interpunctions "+ľščťžýáíé!ň"
It's also in the comments in the manual for mb_decode_mimeheader
, and I actually assume it is a bug. None in the database, so I'd file it as a new one.
However, AFAIK imap_mime_header_decode
will cope with both your encodings without a problem, so that will keep your code going.
About the mysterious underscore in the Subject header field:
RFC2047 4.2(2) states explicitly:
The 8-bit hexadecimal value 20 (e.g., ISO-8859-1 SPACE) may be represented as "_" (underscore, ASCII 95.). (This character may not pass through some internetwork mail gateways, but its use will greatly enhance readability of "Q" encoded data with mail readers that do not support this encoding.) Note that the "_" always represents hexadecimal 20, even if the SPACE character occupies a different code position in the character set in use.
The encoding rule for Subject line is documented in the very RFC2047 .
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With