Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Internal representation of strings in PHP

Tags:

string

php

memory

I'm writing a simple website parser on PHP 5.2.10.
When using default internal encoding (which is ISO-8859-1), I get an error always at the same function call:

$start = mb_strpos($index, '<a name=gr1>');

Fatal error: Allowed memory size of 50331648 bytes exhausted (tried to allocate 11924760 bytes)

The length of the string $index in this case was 2981190 bytes - exactly 4 times less than PHP tried to allocate.

Now, if I use

mb_internal_encoding('UTF-8')

the error disappears. Does that mean that PHP uses more memory for single-byte strings that for multibyte ones? How's that possible? Any ideas?

UPD: Memory usage doesn't seem to depend on encoding: average memory_get_usage() is almost the same using UTF-8 and ISO-8859-1. I think that the problem might be in mb_strpos. In fact, the string $index has Windows-1251 encoding (cyrillic), so it contains symbols that are not valid for UTF-8. This may cause mb_strpos to somehow try to convert or just use the additional memory for some needs. Will try to find the answer in the sources of mb_strpos.

like image 693
Dmitry Avatar asked Aug 25 '12 20:08

Dmitry


1 Answers

Sorry if you've already thought of these potential issues.

The multibyte string functions will check UTF-8 encodings for errors and, if there are invalid characters, returns an empty string or false (as in the case of mb_strpos(): http://www.serverphorums.com/read.php?7,552099

Are you checking the result you're getting using the === operator to ensure that you're not receiving false instead of 0?

The mb_strpos() function uses mbfl_strpos(), which makes copies of the strings (needle, haystack) when it has to perform conversions (leading to increases in memory, as you observed): https://github.com/php/php-src/blob/master/ext/mbstring/libmbfl/mbfl/mbfilter.c#L811

So, I'm wondering if using the default internal encoding (ISO-8859-1) let everything through, and the memory limit was hit, whereas the utf-8 encoding short circuited due to the illegal characters and returned false (which, if you were testing with ==, would make it appear that the function merely didn't find a match.)

Worth a shot :)

like image 185
AdamJonR Avatar answered Sep 18 '22 16:09

AdamJonR