I have a UTF-8 encoded XML file which is emailed as an attachment. When the email recipient opens the email and saves the attachment, the XML file is no longer UTF-8 (it's instead reporting ANSI encoding). In this instance, the recipient used Microsoft Outlook, if it matters. I am programming this in an environment where I cannot rely on the availability of suitable MIME libraries, so I need to understand where I am going wrong. Before emailing the XML file, after creating it on the server, I can see using the Linux file command that it's a UTF-8 file. Separate to this, the XML also has a version header <code><?xml version="1.0" encoding="UTF-8"?></code> (which isn't really relevant to my problem, but I'm including it for completeness). I'm pretty sure that my code which emails the file is the problem here, but I'm uncertain as to the "right" way to do this. The headers I'm sending are: <pre class="prettyprint"><code>"Mime-Version" "1.0" "Content-Type" "multipart/mixed; boundary="__==NAHDHDH2.28ABSDJxjhkjhsdkjhd___"\n\n" </code></pre> The body of the email is: <pre class="prettyprint"><code>--__==NAHDHDH2.28ABSDJxjhkjhsdkjhd___\n Content-Type: text/plain; charset="utf-8"; format=flowed\n Content-Transfer-Encoding: 7bit\n\n Please find attached the data file generated --__==NAHDHDH2.28ABSDJxjhkjhsdkjhd___\n Content-Type: text/plain; charset="utf-8"\n Content-Disposition: attachment; filename="My_File_Name"\n\n XML FILE CONTENTS GO HERE --__==NAHDHDH2.28ABSDJxjhkjhsdkjhd___--\n </code></pre> Questions: <ul> <li>should I be using <code>quoted-printable</code>, <code>8bit</code> or other type of <code>Content-Transfer-Encoding</code> here? I have tried all of them, but it hasn't changed the result. </li> <li>Is <code>Content-Type: text/plain</code> correct for an XML attachment? </li> <li>Any other suggestions?</li> </ul>

By specifying <code>text/plain</code> you basically surrender control to the remote client's text-handling abilities, which are apparently limited in this particular case. XML is Unicode by spec, so by choosing a better content-type, you are more likely to succeed. Try <code>text/xml</code> or <code>application/xml</code> instead, or even the completely opaque <code>application/octet-stream</code>, which should only allow the recipient to save it on disk in byte-for-byte identical form. The content transfer encoding should not affect this behavior at all, but since you seem to be unclear on its significance, here is a brief discussion. The content-transfer-encoding is completely transparent; it will not affect what is delivered or what the remote client can do with it. Which content transfer encoding to choose depends on the nature of your data and the capabilities of the email system which it needs to be transported through. If it's not 8-bit clean, you need a 7-bit CTE to encapsulate it into. If the content has lines which are too long to fit into SMTP, it needs to be encapsulated into something with shorter lines. But the remote client will extract whatever is inside the encapsulation at the other end. Use whatever circumstances dictate. There is a hierarchy of content transfer encodings for different circumstances: <ul> <li> <code>7bit</code> is appropriate if your data is completely 7-bit ASCII and has no lines longer than approximately 990 characters. Then it can survive even a crude old SMTP transfer without modification. In the absence of any explicit <code>Content-Transfer-Encoding:</code> header, this is the default according to the standards (although you frequently see stuff with 8-bit data in it without an explicit CTE, or even with an explicit <code>7bit</code> declaration). </li> <li> <code>8bit</code> relaxes the requirement for the data to be 7-bit clean. If all systems which transport this message support the ESMTP <code>8BITMIME</code> extension, this should be fine for data with restricted line lengths. </li> <li> <code>binary</code> additionally allows for unlimited line length. In theory, you should be able to use this to pass through unrestricted content, but in practice, this seems to trigger glitches when systems don't strictly adhere to specifications. A typical symptom is that overlong lines are truncated or folded in transit, violating the integrity of the payload. To avoid problems like that (and to better adhere to the letter and the spirit of the standards for interoperability) you're better off with one of the following. </li> <li> <code>base64</code> accepts unrestricted content, but encodes it in a format which meets strict requirements for restricted line length and a severely constrained 7-bit character repertoire. It expands the payload to a bit more than 4/3 of the original size. Example: </li> </ul> <pre class="prettyprint"><code> ugqcA7R5cPq667vNaSifRUH9HsW00NqZ1gwICk0pNrUkXFpNIFOpbf3o 5ml8cqqSygkp8KBgPbHrqnDXvZTEBOkNo7ThE+BAvexa75Tm0Ebo/Yjl y697pMp1+dnSlk3YTqxkPI9vqpple13dXLHlvnFDmSi0gqIMSwo7kUFD SivAWhyCBR6tFO3lY1Pk6lz78+zgL28VthI72kVRkrWWtzoFef/4u5Ip GR00CtsNNEJo01GAQGpkTNFT9U9Q/UI9CMGgaI9E9RkMaTDTQICBEyaE woSCQOrNGA== </code></pre> <ul> <li> <code>quoted-printable</code> similarly accepts arbitrary content, but encodes selected bytes to 3x the original. When most of the input is ASCII, this is a tolerable amount of overhead. In other words, this is suitable for roughly textual format with occasional non-ASCII content, such as text in many Western languages using an 8-bit encoding, or formats like HTML where the ASCII markup dominates over the actual content, in pretty much any language. Example:</li> </ul> <pre class="prettyprint"><code> <?xml version=3D"1.0" encoding=3D"UTF-8"?>h=C3=ABll=C3=B6 = w=C3=B6rld </code></pre> Quoted printable is not hard to implement at all, and would seem suitable for your scenario. All of this is codified in the MIME RFCs 2045 through 2048. Wikipedia has nice readable articles about e.g. base64 and quoted-printable. It's not clear from your description whether you just declared your content to be quoted-printable, or actually encoded it. I've seen people do the former and act surprised when it didn't work, but hope you did the latter. Just a cautionary tale.

Confused about Content-Transfer-Encoding when emailing an XML file as an attachment

Tags:

email

xml

encoding

mime

utf-8

I have a UTF-8 encoded XML file which is emailed as an attachment. When the email recipient opens the email and saves the attachment, the XML file is no longer UTF-8 (it's instead reporting ANSI encoding). In this instance, the recipient used Microsoft Outlook, if it matters.

I am programming this in an environment where I cannot rely on the availability of suitable MIME libraries, so I need to understand where I am going wrong.

Before emailing the XML file, after creating it on the server, I can see using the Linux file command that it's a UTF-8 file. Separate to this, the XML also has a version header <?xml version="1.0" encoding="UTF-8"?> (which isn't really relevant to my problem, but I'm including it for completeness). I'm pretty sure that my code which emails the file is the problem here, but I'm uncertain as to the "right" way to do this.

The headers I'm sending are:

"Mime-Version" "1.0"
"Content-Type" "multipart/mixed; boundary="__==NAHDHDH2.28ABSDJxjhkjhsdkjhd___"\n\n"

The body of the email is:

--__==NAHDHDH2.28ABSDJxjhkjhsdkjhd___\n
Content-Type: text/plain; charset="utf-8"; format=flowed\n
Content-Transfer-Encoding: 7bit\n\n
Please find attached the data file generated 
--__==NAHDHDH2.28ABSDJxjhkjhsdkjhd___\n
Content-Type: text/plain; charset="utf-8"\n
Content-Disposition: attachment; filename="My_File_Name"\n\n
XML FILE CONTENTS GO HERE
--__==NAHDHDH2.28ABSDJxjhkjhsdkjhd___--\n

Questions:

should I be using quoted-printable, 8bit or other type of Content-Transfer-Encoding here? I have tried all of them, but it hasn't changed the result.
Is Content-Type: text/plain correct for an XML attachment?
Any other suggestions?

302

asked Jan 25 '16 18:01

Leroy

1 Answers

By specifying text/plain you basically surrender control to the remote client's text-handling abilities, which are apparently limited in this particular case. XML is Unicode by spec, so by choosing a better content-type, you are more likely to succeed. Try text/xml or application/xml instead, or even the completely opaque application/octet-stream, which should only allow the recipient to save it on disk in byte-for-byte identical form.

The content transfer encoding should not affect this behavior at all, but since you seem to be unclear on its significance, here is a brief discussion.

The content-transfer-encoding is completely transparent; it will not affect what is delivered or what the remote client can do with it. Which content transfer encoding to choose depends on the nature of your data and the capabilities of the email system which it needs to be transported through. If it's not 8-bit clean, you need a 7-bit CTE to encapsulate it into. If the content has lines which are too long to fit into SMTP, it needs to be encapsulated into something with shorter lines. But the remote client will extract whatever is inside the encapsulation at the other end. Use whatever circumstances dictate.

There is a hierarchy of content transfer encodings for different circumstances:

7bit is appropriate if your data is completely 7-bit ASCII and has no lines longer than approximately 990 characters. Then it can survive even a crude old SMTP transfer without modification. In the absence of any explicit Content-Transfer-Encoding: header, this is the default according to the standards (although you frequently see stuff with 8-bit data in it without an explicit CTE, or even with an explicit 7bit declaration).
8bit relaxes the requirement for the data to be 7-bit clean. If all systems which transport this message support the ESMTP 8BITMIME extension, this should be fine for data with restricted line lengths.
binary additionally allows for unlimited line length. In theory, you should be able to use this to pass through unrestricted content, but in practice, this seems to trigger glitches when systems don't strictly adhere to specifications. A typical symptom is that overlong lines are truncated or folded in transit, violating the integrity of the payload. To avoid problems like that (and to better adhere to the letter and the spirit of the standards for interoperability) you're better off with one of the following.
base64 accepts unrestricted content, but encodes it in a format which meets strict requirements for restricted line length and a severely constrained 7-bit character repertoire. It expands the payload to a bit more than 4/3 of the original size. Example:

    ugqcA7R5cPq667vNaSifRUH9HsW00NqZ1gwICk0pNrUkXFpNIFOpbf3o
    5ml8cqqSygkp8KBgPbHrqnDXvZTEBOkNo7ThE+BAvexa75Tm0Ebo/Yjl
    y697pMp1+dnSlk3YTqxkPI9vqpple13dXLHlvnFDmSi0gqIMSwo7kUFD
    SivAWhyCBR6tFO3lY1Pk6lz78+zgL28VthI72kVRkrWWtzoFef/4u5Ip
    GR00CtsNNEJo01GAQGpkTNFT9U9Q/UI9CMGgaI9E9RkMaTDTQICBEyaE
    woSCQOrNGA==

quoted-printable similarly accepts arbitrary content, but encodes selected bytes to 3x the original. When most of the input is ASCII, this is a tolerable amount of overhead. In other words, this is suitable for roughly textual format with occasional non-ASCII content, such as text in many Western languages using an 8-bit encoding, or formats like HTML where the ASCII markup dominates over the actual content, in pretty much any language. Example:

    <?xml version=3D"1.0" encoding=3D"UTF-8"?>h=C3=ABll=C3=B6 =
    w=C3=B6rld

Quoted printable is not hard to implement at all, and would seem suitable for your scenario.

All of this is codified in the MIME RFCs 2045 through 2048. Wikipedia has nice readable articles about e.g. base64 and quoted-printable.

It's not clear from your description whether you just declared your content to be quoted-printable, or actually encoded it. I've seen people do the former and act surprised when it didn't work, but hope you did the latter. Just a cautionary tale.

answered Oct 19 '22 20:10

tripleee

Related questions
                            
                                How to prevent android layer drawable shapes (e.g. circle) from scaling
                            
                                SUM and COUNT xPath expression doesn't work in Oracle 11.2
                            
                                are DTDs still used to conform xml?
                            
                                XmlNode InnerXml vs OuterXml
                            
                                Added Gson to the pom.xml but is not found
                            
                                Remove encoding from XmlWriter
                            
                                Get all xml attribute values in python3 using ElementTree
                            
                                JAXB episode compilation with include does not work
                            
                                Onvif - trying to understand how it works
                            
                                "Validates resource references inside Android XML files"
                            
                                Consume multiple resources in a RESTful Web Service
                            
                                how to simulate lack of network connectivity in unit testing
                            
                                Jackson Serialization: Setting field value as XML element name
                            
                                Why is VALUES(CONVERT(XML,'...')) much slower than VALUES(@xml)?
                            
                                Why does Microsoft match an XSD xs:integer to a string when importing WSDL?
                            
                                How to convert Android xml layout to png/svg to use in iOS version
                            
                                Apache POI Excel Table-TotalsRow
                            
                                In Python, how do I refer to an XML tag that contains a hyphen
                            
                                How to remove top border shadow from ActionBar
                            
                                Are XML parser and XML processor the same?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With