I have a UTF-8 encoded XML file which is emailed as an attachment. When the email recipient opens the email and saves the attachment, the XML file is no longer UTF-8 (it's instead reporting ANSI encoding). In this instance, the recipient used Microsoft Outlook, if it matters.
I am programming this in an environment where I cannot rely on the availability of suitable MIME libraries, so I need to understand where I am going wrong.
Before emailing the XML file, after creating it on the server, I can see using the Linux file command that it's a UTF-8 file. Separate to this, the XML also has a version header <?xml version="1.0" encoding="UTF-8"?>
(which isn't really relevant to my problem, but I'm including it for completeness). I'm pretty sure that my code which emails the file is the problem here, but I'm uncertain as to the "right" way to do this.
The headers I'm sending are:
"Mime-Version" "1.0"
"Content-Type" "multipart/mixed; boundary="__==NAHDHDH2.28ABSDJxjhkjhsdkjhd___"\n\n"
The body of the email is:
--__==NAHDHDH2.28ABSDJxjhkjhsdkjhd___\n
Content-Type: text/plain; charset="utf-8"; format=flowed\n
Content-Transfer-Encoding: 7bit\n\n
Please find attached the data file generated
--__==NAHDHDH2.28ABSDJxjhkjhsdkjhd___\n
Content-Type: text/plain; charset="utf-8"\n
Content-Disposition: attachment; filename="My_File_Name"\n\n
XML FILE CONTENTS GO HERE
--__==NAHDHDH2.28ABSDJxjhkjhsdkjhd___--\n
Questions:
quoted-printable
, 8bit
or other type of
Content-Transfer-Encoding
here? I have tried all of them, but it
hasn't changed the result. Content-Type: text/plain
correct for an XML attachment? Content transfer encoding defines encoding methods for transforming binary email message data into the US-ASCII plain text format. This transformation allows the message to travel through older SMTP messaging servers that only support messages in US-ASCII text. Content transfer encoding is defined in RFC 2045.
XML Encoding is defined as the process of converting Unicode characters into binary format and in XML when the processor reads the document it mandatorily encodes the statement to the declared type of encodings, the character encodings are specified through the attribute 'encoding'.
MIME Encoding Methods The Content-Transfer-Encoding header is used to specify how a MIME message or body part has been encoded, so that it can be decoded by its recipient.
By specifying text/plain
you basically surrender control to the remote client's text-handling abilities, which are apparently limited in this particular case. XML is Unicode by spec, so by choosing a better content-type, you are more likely to succeed. Try text/xml
or application/xml
instead, or even the completely opaque application/octet-stream
, which should only allow the recipient to save it on disk in byte-for-byte identical form.
The content transfer encoding should not affect this behavior at all, but since you seem to be unclear on its significance, here is a brief discussion.
The content-transfer-encoding is completely transparent; it will not affect what is delivered or what the remote client can do with it. Which content transfer encoding to choose depends on the nature of your data and the capabilities of the email system which it needs to be transported through. If it's not 8-bit clean, you need a 7-bit CTE to encapsulate it into. If the content has lines which are too long to fit into SMTP, it needs to be encapsulated into something with shorter lines. But the remote client will extract whatever is inside the encapsulation at the other end. Use whatever circumstances dictate.
There is a hierarchy of content transfer encodings for different circumstances:
7bit
is appropriate if your data is completely 7-bit ASCII and has no lines longer than approximately 990 characters. Then it can survive even a crude old SMTP transfer without modification. In the absence of any explicit Content-Transfer-Encoding:
header, this is the default according to the standards (although you frequently see stuff with 8-bit data in it without an explicit CTE, or even with an explicit 7bit
declaration).
8bit
relaxes the requirement for the data to be 7-bit clean. If all systems which transport this message support the ESMTP 8BITMIME
extension, this should be fine for data with restricted line lengths.
binary
additionally allows for unlimited line length. In theory, you should be able to use this to pass through unrestricted content, but in practice, this seems to trigger glitches when systems don't strictly adhere to specifications. A typical symptom is that overlong lines are truncated or folded in transit, violating the integrity of the payload. To avoid problems like that (and to better adhere to the letter and the spirit of the standards for interoperability) you're better off with one of the following.
base64
accepts unrestricted content, but encodes it in a format which meets strict requirements for restricted line length and a severely constrained 7-bit character repertoire. It expands the payload to a bit more than 4/3 of the original size. Example:
ugqcA7R5cPq667vNaSifRUH9HsW00NqZ1gwICk0pNrUkXFpNIFOpbf3o
5ml8cqqSygkp8KBgPbHrqnDXvZTEBOkNo7ThE+BAvexa75Tm0Ebo/Yjl
y697pMp1+dnSlk3YTqxkPI9vqpple13dXLHlvnFDmSi0gqIMSwo7kUFD
SivAWhyCBR6tFO3lY1Pk6lz78+zgL28VthI72kVRkrWWtzoFef/4u5Ip
GR00CtsNNEJo01GAQGpkTNFT9U9Q/UI9CMGgaI9E9RkMaTDTQICBEyaE
woSCQOrNGA==
quoted-printable
similarly accepts arbitrary content, but encodes selected bytes to 3x the original. When most of the input is ASCII, this is a tolerable amount of overhead. In other words, this is suitable for roughly textual format with occasional non-ASCII content, such as text in many Western languages using an 8-bit encoding, or formats like HTML where the ASCII markup dominates over the actual content, in pretty much any language. Example: <?xml version=3D"1.0" encoding=3D"UTF-8"?>h=C3=ABll=C3=B6 =
w=C3=B6rld
Quoted printable is not hard to implement at all, and would seem suitable for your scenario.
All of this is codified in the MIME RFCs 2045 through 2048. Wikipedia has nice readable articles about e.g. base64 and quoted-printable.
It's not clear from your description whether you just declared your content to be quoted-printable, or actually encoded it. I've seen people do the former and act surprised when it didn't work, but hope you did the latter. Just a cautionary tale.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With