Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is PHP imap_headerinfo() function much slower on large mailboxes?

Tags:

php

email

imap

I did some testing on using imap_headerinfo() function and I am little confused with the results.

On small mailboxes, getting data for 30 messages takes 0.5 secs. On mailboxes with approximately 500 messages it takes about 7 secs to retrieve data for the same number of messages (30 messages).

Why would the size of mailbox had anything to do with the time needed to retrieve the header of the single email message? Is this normal?

I used this code to test time:

$time_start = microtime(true);
for ($i=0; $i < 30; $i++) {
    message_header[$i] = imap_headerinfo($mbox, $i+1);
}
$time = microtime(true) - $time_start;

Edit:

Mailboxes are on the same account.

I took Christian Gollhardt's advice and I have measured every call to imap_headerinfo() function.

The result are even stranger! First and then every 22th call to the imap_headerinfo() function takes 10000 times more that the others. Example: first call takes about 0.39 secs, then other 20 calls take about 0.0001 secs, then 22th call takes about 0.47 secs, then other 20 calls about 0.00004, and so on.

Edit 2:

After some more research there is something else that came up.

If you use:

$message_header[$i] = imap_headerinfo($mbox, $i + 1);

it takes about 0.4 secs for every 22th call and about 0.0001 sec for other calls.

However, you would expect the same results with:

$message_header[$i] = imap_headerinfo($mbox, 30 - $i);

But, in this case it takes about 0.2 secs for every call!

The only difference here is that in the second example headers are retrieved in the reversed message order (from the 30th to the 1st) and for some reason it greatly affects the time needed for the operation. Why?

Note: Tested on gmail account too. Exactly same ratio between numbers, so I guess it is not server related.

Thank you in advance!

like image 788
milosh Avatar asked Oct 01 '22 04:10

milosh


1 Answers

When looking into the PHP sources for the IMAP module, you will find that the imap_headerinfo function is using mail_fetchstructure, which is a function from c-client library.

The documentation for c-client explains the workings of the mail_fetchstructure function like this:

This function causes a fetch of all the structured information (envelope, internal date, RFC 822 size, flags, and body structure) for the given msgno and, in the case of IMAP, up to MAPLOOKAHEAD (a parameter in IMAP2.H) subsequent messages which are not yet in the cache. No fetch is done if the envelope for the given msgno is already in the cache. The ENVELOPE and the BODY for this msgno is returned. It is possible for the BODY to be NIL, in which case no information is available about the structure of the message body.

The one IMAP header file I found defined this lookahead value as 20, so the first call to the function causes it to fetch 20 additional messages from the mailbox. This explains the behaviour you observed that every 22nd call to the function takes a lot more time than all the other ones.

If you fetch the messages in the reverse order, you cause the library to first load 21 messages beginning with the one you specified in the function call. The next call checks whether the requested message is already cached, which it isn't because it's before the ones which were loaded previously, so the cache is discarded and repeats the process. Therefore, each and every call in the reverse loop will load up to 21 messages.

However, this doesn't really explain the performance difference on different mailbox sizes. My explanation for this behaviour is more guesswork than accurate research: The c-client library also pre-maps the message numbers to their appropriate UIDs. The IMAP header defines a UID lookahead count of 1000. This would explain a certain amount of performance loss, but I don't fathom why this would cause such a large difference, but it is the only explanation I can come up with at the moment.

Trying this out on mailboxes with 1000 and 2000 messages would maybe yield more insight whether this UID lookup has something to do with it. If it does, the performance between the the 500 messages and the 1000 messages should drop significantly and the 2000 messages should pretty much be just as slow as the 1000 messages. Using a network sniffer to check what data is actually requested from the server may also be worth a try. Unfortunately, I don't have a fitting test environment here to try this out by myself.

like image 172
patlkli Avatar answered Nov 01 '22 01:11

patlkli