Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Email deduplication

is it true that e-mail can be deduplicated by just using some of their headers as according to RFC their message-id should be unique?

Is there any way to calculate the chance of 1 single email beeing missed in this deduplication method below (sha512 hash of those 3 headers)?

// $email is a parsed array containing 3 keys (mime headers) -> message_id, subject and date. $hashStr = $email['message_id']; $hashStr .= $email['subject']; $hashStr .= $email['date']; $uniqueEmailId = hash('sha512', $hashStr);

It is kind of mission critical that no single email will be missed, chances are that we are having to deduplicate over several (>2) billion mime files.

like image 375
Floris Avatar asked Apr 03 '14 15:04

Floris


People also ask

What does it mean by duplicate email?

Duplicates of the same message will occur if your email account is configured to forward email to multiple addresses. For example, the original may arrive in your business acount with a copy forwarded to your home account. They will both arrive in the same inbox if the same mail client is checking both addresses.

What do you mean by deduplication?

Data deduplication is a process that eliminates excessive copies of data and significantly decreases storage capacity requirements. Deduplication can be run as an inline process as the data is being written into the storage system and/or as a background process to eliminate duplicates after the data is written to disk.

What is the benefit of deduplication?

What are the advantages of data deduplication? It can provide larger backup capacity, achieve longer-term data retention, and also achieve continuous verification of backup data, improve the level of data recovery service, and facilitate the realization of data disaster recovery.

What is data deduplication with example?

If the email platform is backed up or archived, all 100 instances are saved, requiring 100 MB of storage space. With data deduplication, only one instance of the attachment is stored; each subsequent instance is referenced back to the one saved copy. In this example, a 100 MB storage demand drops to 1 MB.


1 Answers

The SHA512 hash produces a hash value with 512 bits of data. Assuming a random distribution of bits, this works out to more than 1.34e+154 possible values. Even with over 2e+9 samples, the chances of an accidental collision are very near zero.

However, your input for the hash isn't quite that random. message_id is a globally unique identifier which "only" has 5.3e+36 possible values, and the randomness depends on the implementation. According to the wiki link, the odds of a collision are about 50% at 4.2e+18 samples. Email addresses and dates are likely significantly higher than that.

That said, without actually doing the probability math, I would say that the odds are negligible.

like image 143
klugerama Avatar answered Oct 02 '22 03:10

klugerama