Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python email.header.decode_header fails for multiline headers

Tags:

python

email

I'm building a system that reads emails from a gmail account and fetches the subjects, using Python's imaplib and email modules. Sometimes, emails received from a hotmail account have line breaks in their headers, for instance:

In [4]: message['From']
Out[4]: '=?utf-8?B?aXNhYmVsIG1hcsOtYSB0b2Npbm8gZ2FyY8OtYQ==?=\r\n\t<[email protected]>'

If I try to decode that header, it does nothing:

In [5]: email.header.decode_header(message['From'])
Out[5]: [('=?utf-8?B?aXNhYmVsIG1hcsOtYSB0b2Npbm8gZ2FyY8OtYQ==?=\r\n\t<[email protected]>', None)]

However, if I replace the line break and tab with a space, it works:

In [6]: email.header.decode_header(message['From'].replace('\r\n\t', ' '))
Out[6]: [('isabel mar\xc3\xada tocino garc\xc3\xada', 'utf-8'), ('<[email protected]>', None)]

Is this a bug in decode_header? If not, I would like to know what other special cases like this I should be aware of.

like image 345
José Tomás Tocino Avatar asked Dec 28 '13 16:12

José Tomás Tocino


2 Answers

It is a bug in decode_header, which bug is present in python2.7 and fixed in python3.3.

>>> sys.version_info
sys.version_info(major=3, minor=3, micro=2, releaselevel='final', serial=0)
>>> email.header.decode_header('=?utf-8?B?aXNhYmVsIG1hcsOtYSB0b2Npbm8gZ2FyY8OtYQ==?=\r\n\t<[email protected]>')
[(b'isabel mar\xc3\xada tocino garc\xc3\xada', 'utf-8'), (b'<[email protected]>', None)]

vs

>>> sys.version_info
sys.version_info(major=2, minor=7, micro=5, releaselevel='final', serial=0)
>>> email.header.decode_header('=?utf-8?B?aXNhYmVsIG1hcsOtYSB0b2Npbm8gZ2FyY8OtYQ==?=\r\n\t<[email protected]>')
[('=?utf-8?B?aXNhYmVsIG1hcsOtYSB0b2Npbm8gZ2FyY8OtYQ==?=\r\n\t<[email protected]>', None)]
like image 115
Robᵩ Avatar answered Oct 16 '22 08:10

Robᵩ


This error is still happening in some Python 2.7 versions, so the following workaround could be used:

>>> email.header.decode_header('=?utf-8?B?aXNhYmVsIG1hcsOtYSB0b2Npbm8gZ2FyY8OtYQ==?=\r\n\t<[email protected]>'.replace('\r\n\t', ' '))
[('isabel mar\xc3\xada tocino garc\xc3\xada', 'utf-8'), ('<[email protected]>', None)]

It replaces the CLRF and the tab feed for a whitespace. With this, decode_header will parse correctly the header.

like image 1
Benjy Malca Avatar answered Oct 16 '22 06:10

Benjy Malca