Parsing multibyte string in PHP

Question

I would like to write a (HTML) parser based on state machine but I have doubts how to acctually read/use an input. I decided to load the whole input into one string and then work with it as with an array and hold its index as current parsing position.

There would be no problems with single-byte encoding, but in multi-byte encoding each value does not represent a character, but a byte of a character.

Example:

$mb_string = 'žščř'; //4 multi-byte characters in UTF-8

for($i=0; $i < 4; $i++)
{
   echo $mb_string[$i], PHP_EOL;
}

Outputs:

Ĺ
ž
Ĺ
Ą

This means I cannot iterate through the string in a loop to check single characters, because I never know if I am in the middle of an character or not.

So the questions are:

How do I multi-byte safe read a single character from a string in a performance friendly way?
Is it good idea to work with the string as it was an array in this case?
How would you read the input?

Your Common Sense · Accepted Answer

http://php.net/mb_string is the thing you're looking for

just mb_substr characters one by one
not until PHP6
what input exactly? The usual way in general

zaf · Answer

mb_internal_encoding("UTF-8");

$mb_string = 'žščř';

$l=mb_strlen($mb_string);

for($i=0;$i<$l;$i++){
    print(mb_substr($mb_string,$i,1)."<br/>");
}

Parsing multibyte string in PHP

Tags:

string

php

parsing

multibyte

Petr Peller

2 Answers

Your Common Sense

zaf

Recent Activity

Donate For Us

Parsing multibyte string in PHP

Tags:

string

php

parsing

multibyte

Petr Peller

2 Answers

Your Common Sense

zaf

Related questions

Recent Activity

Donate For Us