My application makes extensive use of mb_ string functions and switching to php 7 resulted in an overall slower application. I tracked down the issues to the mb_ string functions. Here are the benchmark code and the results:
$time = microtime();
$time = explode(' ', $time);
$start = $time[1] + $time[0];
$startms = $time[0];
for ($i=0; $i<100000; $i++) {
$a = mb_strlen("fdsfdssdfoifjosdifjosdifjosdij:ά", "UTF-8");
}
$time = microtime();
$time = explode(' ', $time);
$finish = $time[1] + $time[0];
$finishms = $time[0];
$total_time = round(($finish - $start), 4);
echo "mb_strlen: " . $total_time*1000 ." milliseconds<br/>";
$time = microtime();
$time = explode(' ', $time);
$start = $time[1] + $time[0];
$startms = $time[0];
for ($i=0; $i<100000; $i++) {
$a = mb_stripos("fdsfdssdfoifjosdifjosdifjosdij:ά", "α", 0, "UTF-8");
}
$time = microtime();
$time = explode(' ', $time);
$finish = $time[1] + $time[0];
$finishms = $time[0];
$total_time = round(($finish - $start), 4);
echo "mb_stripos: " . $total_time*1000 ." milliseconds<br/>";
$time = microtime();
$time = explode(' ', $time);
$start = $time[1] + $time[0];
$startms = $time[0];
for ($i=0; $i<100000; $i++) {
$a = mb_substr("fdsfdssdfoifjosdifjosdifjosdij:ά", $i, 1, "UTF-8");
}
$time = microtime();
$time = explode(' ', $time);
$finish = $time[1] + $time[0];
$finishms = $time[0];
$total_time = round(($finish - $start), 4);
echo "mb_substr: " . $total_time*1000 ." milliseconds<br/>";
The platform is Windows 7 64bit, IIS 7.5:
php 5.3.28
mb_strlen: 250 milliseconds
mb_stripos: 3078.1 milliseconds
mb_substr: 281.3 milliseconds
php 7.1.1
mb_strlen: 406.3 milliseconds
mb_stripos: 4796.9 milliseconds
mb_substr: 421.9 milliseconds
I don't know if my set up is wrong or something, but seems inconceivable that the multibyte functions should be slower. Any ideas as to why and what to do to solve this ? Thank you in advance.
Edit: as apokryfos' comment suggests, this may be a Windows only problem.
I can confirm that your result is reproducible on Windows 7. After some experiments, I found a quick solution that IMO should not even have an effect.
As you can see from mb_strlen() function signature, it will use internal encoding if you omit the encoding parameter. This also applies to other functions that you use.
mixed mb_strlen ( string $str [, string $encoding = mb_internal_encoding() ] )
What I found odd is if you set internal encoding to UTF-8 by calling mb_internal_encoding("UTF-8")
and omit the encoding parameter,
the functions got faster.
PHP 5.5 result:
5.5.12
with encoding parameter:
- mb_strlen: 172 ms, result: 5
- mb_substr: 218 ms, result: う
- mb_strpos: 218 ms, result: 3
- mb_stripos: 1,669 ms, result: 3
- mb_strrpos: 234 ms, result: 3
- mb_strripos: 1,685 ms, result: 3
with internal encoding:
- mb_strlen: 47 ms, result: 5
- mb_substr: 78 ms, result: う
- mb_strpos: 62 ms, result: 3
- mb_stripos: 1,669 ms, result: 3
- mb_strrpos: 94 ms, result: 3
- mb_strripos: 1,669 ms, result: 3
PHP 7.0 result:
7.0.12
with encoding parameter:
- mb_strlen: 640 ms, result: 5
- mb_substr: 702 ms, result: う
- mb_strpos: 686 ms, result: 3
- mb_stripos: 7,067 ms, result: 3
- mb_strrpos: 749 ms, result: 3
- mb_strripos: 7,130 ms, result: 3
with internal encoding:
- mb_strlen: 31 ms, result: 5
- mb_substr: 31 ms, result: う
- mb_strpos: 47 ms, result: 3
- mb_stripos: 7,270 ms, result: 3
- mb_strrpos: 62 ms, result: 3
- mb_strripos: 7,116 ms, result: 3
Unfortunately, this quick solution isn't perfect as mb_stripos()
and mb_strripos()
don't seem to be affected.
They are still slow.
This is the code (shortened):
echo PHP_VERSION."\n";
echo "\nwith encoding parameter:\n";
$t = microtime(true)*1000;
for($i=0; $i<100000; $i++){
$n = mb_strlen("あえいおう","UTF-8");
}
$t = microtime(true)*1000-$t;
echo "- mb_strlen: ".number_format($t)." ms, result: {$n}\n";
$t = microtime(true)*1000;
for($i=0; $i<100000; $i++){
$n = mb_substr("あえいおう",-1,1,"UTF-8");
}
$t = microtime(true)*1000-$t;
echo "- mb_substr: ".number_format($t)." ms, result: {$n}\n";
//set internal encoding
//and omit encoding parameter
mb_internal_encoding("UTF-8");
echo "\nwith internal encoding:\n";
$t = microtime(true)*1000;
for($i=0; $i<100000; $i++){
$n = mb_strlen("あえいおう");
}
$t = microtime(true)*1000-$t;
echo "- mb_strlen: ".number_format($t)." ms, result: {$n}\n";
$t = microtime(true)*1000;
for($i=0; $i<100000; $i++){
$n = mb_substr("あえいおう",-1,1);
}
$t = microtime(true)*1000-$t;
echo "- mb_substr: ".number_format($t)." ms, result: {$n}\n";
this sounds like a "performance regression" bug. should probably file a bugreport, so the php core devs can take a look at it, at bugs.php.net
meanwhile, i see that in your snippets you're exclusively using UTF-8. as long as you are exclusively using UTF-8, you might be able to speed it up using preg_, which only supports 1 kind of unicode characterset: UTF-8
. here's my attempt:
function _mb_strlen(string $str, string $encoding = 'UTF-8'): int {
assert ( $encoding === 'UTF-8' );
preg_match ( '/.$/u', $str, $matches, PREG_OFFSET_CAPTURE );
return empty ( $matches ) ? 0 : ($matches [0] [1]) + 1;
}
function _mb_stripos(string $haystack, string $needle, int $offset = 0, string $encoding = 'UTF-8') {
assert ( $encoding === 'UTF-8' );
if ($offset !== 0) {
throw new LogicException ( 'NOT IMPLEMENTED' );
}
preg_match ( '/' . preg_quote ( $needle ) . '/ui', $haystack, $matches, PREG_OFFSET_CAPTURE );
return empty ( $matches ) ? false : $matches [0] [1];
}
function _mb_substr(string $str, int $start, int $length = NULL, string $encoding = 'UTF-8'): string {
assert ( $encoding === 'UTF-8' );
if ($start < 0) {
throw new LogicException ( 'NOT IMPLEMENTED' );
} elseif ($start > 0) {
$rex = '/.{' . $start . '}(.{0,';
} else {
$rex = '/(.{0,';
}
if ($length !== NULL) {
$rex .= $length;
}
$rex .= '})/u';
preg_match ( $rex, $str, $matches );
// var_dump ( $rex, $matches );
return empty ( $matches ) ? '' : $matches [1];
}
here's my benchmark results on 100,000 iterations on php 7.0 on debian 9 linux (kernel 4.9):
mb_strlen got slower, from about 60ms to 100 ms
mb_stripos got A LOT FASTER, from about 1400ms to 75ms
mb_substr got A LOT SLOWER, from about 47 ms to about 800 ms
also note, these functions are not feature complete, as you can see from the LogicException's they throw.
also note that due to a limitation in preg_ , i had to cap mb_substr at 65000 iterations in
for($i = 0; $i < 65000; $i ++) {
$a = mb_substr ( "fdsfdssdfoifjosdifjosdifjosdij:ά", $i, 1, "UTF-8" );
}
because, if you ask preg to look for a string over 65,000 characters long, it will give an error...
also note that your benchmark code can be made a lot easier, all of this
$time = microtime();
$time = explode(' ', $time);
$start = $time[1] + $time[0];
$startms = $time[0];
for ($i=0; $i<100000; $i++) {
$a = mb_strlen("fdsfdssdfoifjosdifjosdifjosdij:ά", "UTF-8");
}
$time = microtime();
$time = explode(' ', $time);
$finish = $time[1] + $time[0];
$finishms = $time[0];
$total_time = round(($finish - $start), 4);
echo "mb_strlen: " . $total_time*1000 ." milliseconds<br/>";
can simply be replaced with
$starttime=microtime(true);
for ($i=0; $i<100000; $i++) {
$a = mb_strlen("fdsfdssdfoifjosdifjosdifjosdij:ά", "UTF-8");
}
$endtime=microtime(true);
echo "mb_strlen: " . number_format(($endtime-$starttime),3) ." seconds<br/>";
which outputs something like: mb_strlen: 0.085 seconds
(which means about 85 milliseconds)
or
echo "mb_strlen: " . number_format(($endtime - $starttime) * 1000),2) . " milliseconds<br/>";
(and i can take a wild guess that it has something to do with realloc() performance, in which linux stomps windows, but i got no proof)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With