Python mailbox encoding errors

Question

First, let me say that I'm a complete beginner at Python. I've never learned the language, I just thought "how hard can it be" when Google turned up nothing but Python snippets to solve my problem. :)

I have a bunch of mailboxes in Maildir format (a backup from the mail server on my old web host), and I need to extract the emails from these. So far, the simplest way I've found has been to convert them to the mbox format, which Thunderbird supports, and it seems Python has a few classes for reading/writing both formats. Seems perfect.

The Python docs even have this little code snippet doing exactly what I need:

src = mailbox.Maildir('maildir', factory=None)
dest = mailbox.mbox('/tmp/mbox')

for msg in src:   #1
    dest.add(msg) #2

Except it doesn't work. And here's where my complete lack of knowledge about Python sets in. On a few messages, I get a UnicodeDecodeError during the iteration (that is, when it's trying to read msg from src, on line #1). On others, I get a UnicodeEncodeError when trying to add msg to dest (line #2).

Clearly it makes some wrong assumptions about the encoding used. But I have no clue how to specify an encoding on the mailbox (For that matter, I don't know what the encoding should be either, but I can probably figure that out once I find a way to actually specify an encoding).

I get stack traces similar to the following:

 File "E:\Python30\lib\mailbox.py", line 102, in itervalues
    value = self[key]
  File "E:\Python30\lib\mailbox.py", line 74, in __getitem__
    return self.get_message(key)
  File "E:\Python30\lib\mailbox.py", line 317, in get_message
    msg = MaildirMessage(f)
  File "E:\Python30\lib\mailbox.py", line 1373, in __init__
    Message.__init__(self, message)
  File "E:\Python30\lib\mailbox.py", line 1345, in __init__
    self._become_message(email.message_from_file(message))
  File "E:\Python30\lib\email\__init__.py", line 46, in message_from_file
    return Parser(*args, **kws).parse(fp)
  File "E:\Python30\lib\email\parser.py", line 68, in parse
    data = fp.read(8192)
  File "E:\Python30\lib\io.py", line 1733, in read
    eof = not self._read_chunk()
  File "E:\Python30\lib\io.py", line 1562, in _read_chunk
    self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
  File "E:\Python30\lib\io.py", line 1295, in decode
    output = self.decoder.decode(input, final=final)
  File "E:\Python30\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 37: character maps to <undefined>

And on the UnicodeEncodeErrors:

  File "E:\Python30\lib\email\message.py", line 121, in __str__
    return self.as_string()
  File "E:\Python30\lib\email\message.py", line 136, in as_string
    g.flatten(self, unixfrom=unixfrom)
  File "E:\Python30\lib\email\generator.py", line 76, in flatten
    self._write(msg)
  File "E:\Python30\lib\email\generator.py", line 108, in _write
    self._write_headers(msg)
  File "E:\Python30\lib\email\generator.py", line 141, in _write_headers
    header_name=h, continuation_ws='	')
  File "E:\Python30\lib\email\header.py", line 189, in __init__
    self.append(s, charset, errors)
  File "E:\Python30\lib\email\header.py", line 262, in append
    input_bytes = s.encode(input_charset, errors)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe5' in position 16:
ordinal not in range(128)

Anyone able to help me out here? (Suggestions for completely different solutions not involving Python are obviously welcome too. I just need a way to access get import the mails from these Maildir files.

Updates:

sys.getdefaultencoding returns 'utf-8'

I uploaded sample messages which cause both errors. This one throws UnicodeEncodeError, and this throws UnicodeDecodeError

I tried running the same script in Python2.6, and got TypeErrors instead:

  File "c:\python26\lib\mailbox.py", line 529, in add
    self._toc[self._next_key] = self._append_message(message)
  File "c:\python26\lib\mailbox.py", line 665, in _append_message
    offsets = self._install_message(message)
  File "c:\python26\lib\mailbox.py", line 724, in _install_message
    self._dump_message(message, self._file, self._mangle_from_)
  File "c:\python26\lib\mailbox.py", line 220, in _dump_message
    raise TypeError('Invalid message type: %s' % type(message))
TypeError: Invalid message type: <type 'instance'>

JV. · Accepted Answer

Note

@Jimmy2Times could be very True in saying that this module may not be updated for 3.0.
This is not an answer particularly rather a probable explanation of what is going on, why, how to reproduce it, other people can benefit from this. I am trying further to complete this answer.

I have put up whatever I could find as Edit below

=====

I think this is what is happening

Among many other characters in your data, you have the two chars - \x9d and \xe5 and these are encoded in some encoding format say iso-8859-1.

when Python 3.0 finds the encoded string it first tries to guess the encoding of the string and then decode it into unicode using the guessed encoding (the way it keeps encoded unicode strings - Link).

I think its the guessing part is where it is going wrong.

To show what's most likely going on -

Let's say the encoding was iso-8859-1 and the wrong guess was cp1252 (as from the first traceback).

The decode for \x9d fails.

In [290]: unicode(u'\x9d'.encode('iso-8859-1'), 'cp1252')
---------------------------------------------------------------------------
<type 'exceptions.UnicodeDecodeError'>    Traceback (most recent call last)

/home/jv/<ipython console> in <module>()

/usr/lib/python2.5/encodings/cp1252.py in decode(self, input, errors)
     13 
     14     def decode(self,input,errors='strict'):
---> 15         return codecs.charmap_decode(input,errors,decoding_table)
     16 
     17 class IncrementalEncoder(codecs.IncrementalEncoder):

<type 'exceptions.UnicodeDecodeError'>: 'charmap' codec can't decode byte 0x9d in position 0: character maps to <undefined>

The decode for \xe5 passes but then, when the message is retrieved from Python somewhere it is trying to encode it in ascii which fails

In [291]: unicode(u'\xe5'.encode('iso-8859-1'), 'cp1252').encode('ascii')
---------------------------------------------------------------------------
<type 'exceptions.UnicodeEncodeError'>    Traceback (most recent call last)

/home/jv/<ipython console> in <module>()

<type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode character u'\xe5' in position 0: ordinal not in range(128)

============

EDIT:

Both your problems are in line #2. Where it first decodes into unicode and then encodes into ascii

First do easy_install chardet

The decode error:

In [75]: decd=open('jalf_decode_err','r').read()

In [76]: chardet.detect(decd)
Out[76]: {'confidence': 0.98999999999999999, 'encoding': 'utf-8'}
##this is what is tried at the back - my guess :)
In [77]: unicode(decd, 'cp1252') 
---------------------------------------------------------------------------
<type 'exceptions.UnicodeDecodeError'>    Traceback (most recent call last)

/home/jv/<ipython console> in <module>()

/usr/lib/python2.5/encodings/cp1252.py in decode(self, input, errors)
     13 
     14     def decode(self,input,errors='strict'):
---> 15         return codecs.charmap_decode(input,errors,decoding_table)
     16 
     17 class IncrementalEncoder(codecs.IncrementalEncoder):

<type 'exceptions.UnicodeDecodeError'>: 'charmap' codec can't decode byte 0x9d in position 2812: character maps to <undefined>'

##this is a FIX- this way all your messages r accepted
In [78]: unicode(decd, chardet.detect(decd)['encoding']) 
Out[78]: u'Return-path: <root@apps2.servage.net>
Envelope-to: public@jalf.dk
Delivery-date: Fri, 22 Aug 2008 16:49:53 -0400
Received: from [77.232.66.102] (helo=apps2.servage.net)
	by c1p.hostingzoom.com with esmtp (Exim 4.69)
	(envelope-from <root@apps2.servage.net>)
	id 1KWdZu-0003VX-HP
	for public@jalf.dk; Fri, 22 Aug 2008 16:49:52 -0400
Received: from apps2.servage.net (apps2.servage.net [127.0.0.1])
	by apps2.servage.net (Postfix) with ESMTP id 4A87F980026
	for <public@jalf.dk>; Fri, 22 Aug 2008 21:49:46 +0100 (BST)
Received: (from root@localhost)
	by apps2.servage.net (8.13.8/8.13.8/Submit) id m7MKnkrB006225;
	Fri, 22 Aug 2008 21:49:46 +0100
Date: Fri, 22 Aug 2008 21:49:46 +0100
Message-Id: <200808222049.m7MKnkrB006225@apps2.servage.net>
To: public@jalf.dk
Subject: =?UTF-8?B?WW5ncmVzYWdlbnMgTnloZWRzYnJldiAyMi44LjA4?=
From: Nyhedsbrev fra Yngresagen <info@yngresagen.dk>
Reply-To: info@yngresagen.dk
Content-type: text/plain; charset=UTF-8
X-Abuse: Servage.net Listid 16329
Mime-Version: 1.0
X-mailer: Servage Maillist System
X-Spam-Status: No, score=0.1
X-Spam-Score: 1
X-Spam-Bar: /
X-Spam-Flag: NO
X-ClamAntiVirus-Scanner: This mail is clean


K\xe6re medlem

H\xe5ber du har en god sommer og er klar p\xe5 at l\xe6se seneste nyt i Yngresagen. God forn\xf8jelse!


::. KOM TIL YS-CAF\xc8 .::
Flere og billigere ungdomsboliger, afskaf 24-\xe5rs-reglen eller hvad synes du? Yngresagen indbyder dig til en \xe5ben debat over kaffe og snacks. Yngresagens Kristian Lauta, Mette Marb\xe6k, og formand Steffen M\xf8ller fort\xe6ller om tidligere projekter og vil gerne diskutere, hvad Yngresagen skal bruge sin tid p\xe5 fremover.  
Vil du diskutere et emne, du br\xe6nder for, eller vil du bare v\xe6re med p\xe5 en lytter?
S\xe5 kom torsdag d. 28/8 kl. 17-19, Kulturhuset 44, 2200 KBH N 
 
::. VIND GAVEKORT & BLIV H\xd8RT .:: 
Yngresagen har lavet et sp\xf8rgeskema, s\xe5 du har direkte mulighed for at sige din mening, og v\xe6re med til at forme Yngresagens arbejde. Brug 5 min. p\xe5 at dele dine holdninger om f.eks. uddannelse, arbejde og unges vilk\xe5r - og vind et gavekort til en musikbutik. Vi tr\xe6kker lod blandt alle svarene og finder tre heldige vindere. Sp\xf8rgeskemaet er her: www.yngresagen.dk

::. YS SPARKER NORDJYLLAND I GANG .::
Nordjylland bliver Yngresagens sunde region. Her er regionsansvarlig Andreas M\xf8ller Stehr ved at starte tre projekter op: 1) L\xf8beklub, 2) F\xf8rstehj\xe6lpskursus, 3) Mad til unge-program.
Vi har brug for flere frivillige til at sparke projekterne i gang. Vi tilbyder gratis fede aktiviteter, gratis t-shirts og ture til K\xf8benhavn, hvor du kan m\xf8de andre unge i YS. Har det fanget din interesse, s\xe5 t\xf8v ikke med at kontakte os: nordjylland@yngresagen.dk tlf. 21935185. 

::. YNGRESAGEN I PRESSEN .::
L\xe6s her et udsnit af sidste nyt om Yngresagen i medierne. L\xe6s og lyt mere p\xe5 hjemmesiden under \u201dYS i pressen\u201d.

:: Radionyhederne: Unge skal informeres bedre om l\xe5n 
Unge ved for lidt om at l\xe5ne penge. Det udnytter banker og rejseselskaber til at give dem l\xe5n med t\xe5rnh\xf8je renter. S\xe5dan lyder det fra formand Steffen M\xf8ller fra landsforeningen Yngresagen. 

:: Danmarks Radio P1: Dansk Folkeparti - de \xe6ldres parti? 
Hvorfor er det kun fattige \xe6ldre og ikke alle fattige, der kan s\xf8ge om at f\xe5 nedsat medielicens?
Dansk Folkepartis ungeordf\xf8rer, Karin N\xf8dgaard, og Yngresagens formand Steffen M\xf8ller debatterer medielicens, \xe6ldrecheck og indflydelse til unge 

:: Frederiksborg Amts Avis: Turen til Roskilde koster en holdning!
For at skabe et m\xf8de mellem politikere og unge fragter Yngresagen unge gratis til \xe5rets Roskilde Festival. Det sker med den s\xe5kaldte Yngrebussen, der kan l\xe6ses mere om p\xe5 www.yngrebussen.dk

 
 
Med venlig hilsen 
Yngresagen

Landsforeningen Yngresagen
Kulturhuset Kapelvej 44
2200 K\xf8benhavn N

tlf. 29644960
info@yngresagen.dk
www.yngresagen.dk


-------------------------------------------------------
Unsubscribe Link: 
http://apps.corecluster.net/apps/ml/r.php?l=16329&e=public%40jalf.dk%0D%0A&id=40830383
-------------------------------------------------------

'

Now its in unicode so it shouldn't give you any problem.

Now the encode problem: It is a problem

In [129]: encd=open('jalf_encode_err','r').read()

In [130]: chardet.detect(encd)
Out[130]: {'confidence': 0.78187650822865284, 'encoding': 'ISO-8859-2'}

#even after the unicode conversion the encoding to ascii fails - because the criteris is strict by default
In [131]: unicode(encd, chardet.detect(encd)['encoding']).encode('ascii')
---------------------------------------------------------------------------
<type 'exceptions.UnicodeEncodeError'>    Traceback (most recent call last)

/home/jv/<ipython console> in <module>()

<type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode character u'\u0159' in position 557: ordinal not in range(128)'

##changing the criteria to ignore
In [132]: unicode(encd, chardet.detect(encd)['encoding']).encode('ascii', 'ignore')
Out[132]: 'Return-path: <info@kollegierneskontor.dk>
Envelope-to: alf@5elements.net
Delivery-date: Tue, 21 Aug 2007 06:10:08 -0400
Received: from pfepc.post.tele.dk ([195.41.46.237]:52065)
	by c1p.hostingzoom.com with esmtp (Exim 4.66)
	(envelope-from <info@kollegierneskontor.dk>)
	id 1INQgX-0003fI-Un
	for alf@5elements.net; Tue, 21 Aug 2007 06:10:08 -0400
Received: from local.com (ns2.datadan.dk [195.41.7.21])
	by pfepc.post.tele.dk (Postfix) with SMTP id ADF4C8A0086
	for <alf@5elements.net>; Tue, 21 Aug 2007 12:10:04 +0200 (CEST)
From: "Kollegiernes Kontor I Kbenhavn" <info@kollegierneskontor.dk>
To: "Jesper Alf Dam" <alf@5elements.net>
Subject: Fornyelse af profil
Date: Tue, 21 Aug 2007 12:10:03 +0200
X-Mailer: Dundas Mailer Control 1.0
MIME-Version: 1.0
Content-Type: Multipart/Alternative;
	boundary="Gark=_20078211010346yhSD0hUCo"
Message-Id: <20070821101004.ADF4C8A0086@pfepc.post.tele.dk>
X-Spam-Status: No, score=0.0
X-Spam-Score: 0
X-Spam-Bar: /
X-Spam-Flag: NO
X-ClamAntiVirus-Scanner: This mail is clean



--Gark=_20078211010346yhSD0hUCo
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: Quoted-Printable

Hej Jesper Alf Dam=0D=0A=0D=0AHusk at forny din profil hos KKIK inden 28.=
 august 2007=0D=0ALog ind p=E5 din profil og benyt ikonet "forny".=0D=0A=0D=
=0AVenlig hilsen=0D=0AKollegiernes Kontor i K=F8benhavn=0D=0A=0D=0Ahttp:/=
/www.kollegierneskontor.dk/=0D=0A=0D=0A

--Gark=_20078211010346yhSD0hUCo
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: Quoted-Printable

<html>=0D=0A<head>=0D=0A=0D=0A<style>=0D=0ABODY, TD {=0D=0Afont-family: v=
erdana, arial, helvetica; font-size: 12px; color: #666666;=0D=0A}=0D=0A</=
style>=0D=0A=0D=0A<title></title>=0D=0A=0D=0A</head>=0D=0A<body bgcolor=3D=
#FFFFFF>=0D=0A<hr size=3D1 noshade>=0D=0A<table cellpadding=3D0 cellspaci=
ng=3D0 border=3D0 width=3D100%>=0D=0A<tr><td >=0D=0AHej Jesper Alf Dam<br=
><br>Husk at forny din profil inden 28. august 2007<br>=0D=0ALog ind p=E5=
 din profil og benyt ikonet "forny".=0D=0A<br><br>=0D=0A<a href=3D"http:/=
/www.kollegierneskontor.dk/">Klik her</a> for at logge ind.<br><br>Venlig=
 hilsen<br>Kollegiernes Kontor i K=F8benhavn=0D=0A</td></tr>=0D=0A</table=
>=0D=0A<hr size=3D1 noshade>=0D=0A</body>=0D=0A</html>=0D=0A

--Gark=_20078211010346yhSD0hUCo--

'

In [133]: len(encd)
Out[133]: 2303

In [134]: len(unicode(encd, chardet.detect(encd)['encoding']).encode('ascii', 'ignore'))
Out[134]: 2302

CAUTION: as you can see there could be minor to moderate loss of data in this procedure. So its upto the user to use it or not.

so the code should look like

import chardet

for msg in src:
    msg=unicode(msg, chardet.detect(msg)['encoding']).encode('ascii', 'ignore')
    dest.add(msg)

Jimmy2Times · Answer

Try it in Python 2.5 or 2.6 instead of 3.0. 3.0 has completely different Unicode handling and this module may not have been updated for 3.0.

Python mailbox encoding errors

Tags:

python

encoding

email-formats

jalf

2 Answers

JV.

Jimmy2Times

Recent Activity

Donate For Us

Python mailbox encoding errors

Tags:

python

encoding

email-formats

jalf

2 Answers

JV.

Jimmy2Times

Related questions

Recent Activity

Donate For Us