Trying to address this issue, I'm trying to wrap my head around the various functions in the Python standard library aimed at supporting RFC 2231. The main aim of that RFC appears to be three-fold: allowing non-ASCII encoding in header parameters, noting the language of a given value, and allowing header parameters to span multiple lines. The email.util
library provides several functions to deal with various aspects of this. As far as I can tell, they work as follows:
decode_rfc2231
only splits the value of such a parameter into its parts, like this:
>>> email.utils.decode_rfc2231("utf-8''T%C3%A4st.txt")
['utf-8', '', 'T%C3%A4st.txt']
decode_params
takes care of detecting RFC2231-encoded parameters. It collects parts which belong together, and also decodes the url-encoded string to a byte sequence. This byte sequence, however, is then encoded as latin1. And all values are enclosed in quotation marks. Furthermore, there is some special handling for the first argument, which still has to be a tuple of two elements, but those two get passed to the result without modification.
>>> email.utils.decode_params([
... (1,2),
... ("foo","bar"),
... ("name*","utf-8''T%C3%A4st.txt"),
... ("baz*0","two"),("baz*1","-part")])
[(1, 2), ('foo', '"bar"'), ('baz', '"two-part"'), ('name', ('utf-8', '', '"Täst.txt"'))]
collapse_rfc2231_value
can be used to convert this triple of encoding, language and byte sequence into a proper unicode string. What has me confused, though, is the fact that if the input was such a triple, then the quotes will be carried over to the output. If, on the other hand, the input was a single quoted string, then these quotes will be removed.
>>> [(k, email.utils.collapse_rfc2231_value(v)) for k, v in
... email.utils.decode_params([
... (1,2),
... ("foo","bar"),
... ("name*","utf-8''T%C3%A4st.txt"),
... ("baz*0","two"),("baz*1","-part")])[1:]]
[('foo', 'bar'), ('baz', 'two-part'), ('name', '"Täst.txt"')]
So it seems that in order to use all this machinery, I'd have to add yet another step to unquote the third element of any tuple I'd encounter. Is this true, or am I missing some point here? I had to figure out a lot of the above with help from the source code, since the docs are a bit vague on the details. I cannot imagine what could be the point behind this selective unquoting. Is there a point to it?
What is the best reference on how to use these functions?
The best I found so far is the email.message.Message
implementation. There, the process seems to be roughly the one outlined above, but every field gets unquoted via _unquotevalue
after the decode_params
, and only get_filename
and get_boundary
collapse their values, all others return a tuple instead. I hope there is something more useful.
Currently the functions from email.utils
are rarely used besides within email.message
. Most users seem to prefer using email.message.Message
directly. There's even a somewhat old issue report on adding unit tests (that would certainly be usable as examples) to Python, even if I'm not sure on how it relates to email.util
.
A short example I found is this blogpost which, however, doesn't contain more than once sentence and a few SLOCs of information about RFC2231 parsing. The author notes, however, that many MTAs use RFC2047 instead. Depending on your usecase, that might also be an issue.
Judging from the few examples I could find I assume your way of parsing using email.util
is the only way to go, even if the long list comprehension is somewhat ugly.
Because of the lack of examples in some respect it could be wise to write a new RFC2231 parser (if you really need a better, maybe faster or more beautiful codebase). A new implementation could be based on existing implementations like the Dovecot RFC2231 parser for compatibility reasons (you could even use the Dovecot unit test. As the C code seems quite complex to me and since I can't find any python implementation besides email.util
and Python2 backports of email.util
the task of porting to Python won't be easy (note that Dovecot is LGPL-licensed, which might be an issue in your project)
I think the email.util
RFC2231 API has not been designed for easy standalone usage but more as a pile of utility methods for use in email.message.Message
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With