Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the normal way to create a unique ID for POP3 emails?

IMAP messages have a UID for which we all rejoice. However, I'm trying to figure out how to generate a unique ID for a POP3 message and having trouble (old systems like hotmail.com only allow POP3).

Available messages to the client are fixed when a POP session opens the maildrop, and are identified by message-number local to that session or, optionally, by a unique identifier assigned to the message by the POP server. This unique identifier is permanent and unique to the maildrop and allows a client to access the same message in different POP sessions. Mail is retrieved and marked for deletion by message-number. When the client exits the session, the mail marked for deletion is removed from the maildrop. - wikipedia

It seems however, that the basic LIST command simply returns an array of temp numbers to allow you to fetch the email. Those numbers are in no way unique though so another extension called UIDL seems to have been added: CAPA (POP3 Extension Mechanism).

POP3 states that a UIDL is unique as long as the message exists.

The unique-id of a message is an arbitrary server-determined string, consisting of one to 70 characters in the range 0x21 to 0x7E, which uniquely identifies a message within a maildrop and which persists across sessions. This persistence is required even if a session ends without entering the UPDATE state. The server should never reuse an unique-id in a given maildrop, for as long as the entity using the unique-id exists.

Note that messages marked as deleted are not listed.

While it is generally preferable for server implementations to store arbitrarily assigned unique-ids in the maildrop, this specification is intended to permit unique-ids to be calculated as a hash of the message. Clients should be able to handle a situation where two identical copies of a message in a maildrop have the same unique-id.

Which makes me think that it's possible that I might download another message a year later (after the first one was deleted) which has the same UIDL and might clash in my system.

Should I just hash the whole message body and use that as an ID?

Rather than fetching the whole email to hash it, perhaps I should just use TOP [id] 1 to hash the headers (and first line) which shouldn't ever match an existing email since the receiving server will always add some type of information correct? So an attacker could never cause a collision since the received or something should have been modified right?

The MDaemon program seems to tackle the issue with partial header hashing:

MDaemon constructs the UIDL results using the message name, date stamp, size, and a few other details about the messages. As a result, if a message is modified on the server, it will appear as “new” to mail clients even if you don’t rename it.

What is the proper way to make an ID for a POP3 email?

Note: Emails often contain a Message-ID header - but I can't rely on that because it could be used as an attack vector to confuse my system. It also is left-out by some email clients.

like image 733
Xeoncross Avatar asked May 01 '13 14:05

Xeoncross


2 Answers

Personally, I would just hash a small subset of the email headers: something like Date, From, Subject, and Message-ID if available.

I often subscribe to mailing lists where you tend receive multiple copies of the same message when someone is replying to you - one that comes directly from them, and another via the mail server. Under those circumstances, many of the headers are different, but I'd really rather not receive two copies of the message.

And the chance of me receiving two different emails, from the same person at the same time, with the same subject and the same message-id seems extremely unlikely.

Of course, it's not impossible. They might not generate message-ids, they might have a blank subject line, they might have a broken clock, and they might have all of those things at the same time. But then again, the router through which their email is passing might be wiped out by a giant meteor from space.

Frankly, the most likely scenario is the email will end up being detected by spam and I'll never see it anyway. Email just isn't that reliable a form of communication. You need something that works reasonably well, but if it doesn't handle that 1 in a million edge case, you'll probably still be ok.

like image 159
James Holderness Avatar answered Sep 28 '22 01:09

James Holderness


Excuse me for questioning your question, but – the real question is: why do you care? It seems to me you are trying real hard to come up with a natural primary key for emails. You shouldn’t need to – and there isn’t really one, anyway. What’s the real problem you are trying to solve?

Your understanding of UIDLs is correct. A message must keep the same UIDL while it is in a particular mailbox, identical messages can have identical UIDLs (but don’t need to), and UIDLs should not repeat within the context of a mailbox, but are not strictly required to. The last requirement in particular highlights the scope and purpose of UIDLs. Once the client has deleted a message from a mailbox, it must (and can) forget about its UIDL, because that value, should it appear again, will henceforth never convey any relationship to the former message.

like image 27
Stefan Paletta Avatar answered Sep 28 '22 00:09

Stefan Paletta