Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - email header decoding UTF-8

is there any Python module which helps to decode the various forms of encoded mail headers, mainly Subject, to simple - say - UTF-8 strings?

Here are example Subject headers from mail files that I have:

Subject: [ 201105311136 ]=?UTF-8?B?IMKnIDE2NSBBYnM=?=. 1 AO; Subject: [ 201105161048 ] GewSt:=?UTF-8?B?IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0?= Subject: [ 201105191633 ]   =?UTF-8?B?IERyZWltb25hdHNmcmlzdCBmw7xyIFZlcnBmbGVndW5nc21laHJhdWZ3ZW5kdW4=?=   =?UTF-8?B?Z2VuIGVpbmVzIFNlZW1hbm5z?= 

text - encoded sting - text

text - encoded string

text - encoded string - encoded string

Encodig could also be something else like ISO 8859-15.

Update 1: I forgot to mention, I tried email.header.decode_header

    for item in message.items():     if item[0] == 'Subject':             sub = email.header.decode_header(item[1])             logging.debug( 'Subject is %s' %  sub ) 

This outputs

DEBUG:root:Subject is [('[ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011', None)]

which does not really help.

Update 2: Thanks to Ingmar Hupp in the comments.

the first example decodes to a list of two tupels:

print decode_header("""[ 201105161048 ] GewSt:=?UTF-8?B?IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0?=""")
[('[ 201105161048 ] GewSt:', None), (' Wegfall der Vorl\xc3\xa4ufigkeit', 'utf-8')]

is this always [(string, encoding),(string, encoding), ...] so I need a loop to concat all the [0] items to one string or how to get it all in one string?

Subject: [ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011

does not decode well:

print decode_header("""[ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011""")

[('[ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011', None)]

like image 244
Hans Moser Avatar asked Sep 07 '11 09:09

Hans Moser


2 Answers

This type of encoding is known as MIME encoded-word and the email module can decode it:

from email.header import decode_header print decode_header("""=?UTF-8?B?IERyZWltb25hdHNmcmlzdCBmw7xyIFZlcnBmbGVndW5nc21laHJhdWZ3ZW5kdW4=?=""") 

This outputs a list of tuples, containing the decoded string and the encoding used. This is because the format supports different encodings in a single header. To merge these into a single string you need to convert them into a shared encoding and then concatenate this, which can be accomplished using Python's unicode object:

from email.header import decode_header dh = decode_header("""[ 201105161048 ] GewSt:=?UTF-8?B?IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0?=""") default_charset = 'ASCII' print ''.join([ unicode(t[0], t[1] or default_charset) for t in dh ]) 

Update 2:

The problem with this Subject line not decoding:

Subject: [ 201101251025 ] ELStAM;=?UTF-8?B?IFZlcmbDvGd1bmcgdm9tIDIx?=. Januar 2011                                                                      ^ 

Is actually the senders fault, which violates the requirement of encoded-words in a header being separated by white-space, specified in RFC 2047, section 5, paragraph 1: an 'encoded-word' that appears in a header field defined as '*text' MUST be separated from any adjacent 'encoded-word' or 'text' by 'linear-white-space'.

If need be, you can work around this by pre-processing these corrupt headers with a regex that inserts a whitespace after the encoded-word part (unless it's at the end), like so:

import re header_value = re.sub(r"(=\?.*\?=)(?!$)", r"\1 ", header_value) 
like image 68
Ingmar Hupp Avatar answered Sep 21 '22 23:09

Ingmar Hupp


I was just testing with encoded headers in Python 3.3, and I found that this is a very convenient way to deal with them:

>>> from email.header import Header, decode_header, make_header  >>> subject = '[ 201105161048 ] GewSt:=?UTF-8?B?IFdlZ2ZhbGwgZGVyIFZvcmzDpHVmaWdrZWl0?=' >>> h = make_header(decode_header(subject)) >>> str(h) '[ 201105161048 ] GewSt:  Wegfall der Vorläufigkeit' 

As you can see it automatically adds whitespace around the encoded words.

It internally keeps the encoded and ASCII header parts separate as you can see when it re-encodes the non-ASCII parts:

>>> h.encode() '[ 201105161048 ] GewSt: =?utf-8?q?_Wegfall_der_Vorl=C3=A4ufigkeit?=' 

If you want the whole header re-encoded you could convert the header to a string and then back into a header:

>>> h2 = Header(str(h)) >>> str(h2) '[ 201105161048 ] GewSt:  Wegfall der Vorläufigkeit' >>> h2.encode() '=?utf-8?q?=5B_201105161048_=5D_GewSt=3A__Wegfall_der_Vorl=C3=A4ufigkeit?=' 
like image 27
Sander Steffann Avatar answered Sep 18 '22 23:09

Sander Steffann