I would like to write a (HTML) parser based on state machine but I have doubts how to acctually read/use an input. I decided to load the whole input into one string and then work with it as with an array and hold its index as current parsing position.
There would be no problems with single-byte encoding, but in multi-byte encoding each value does not represent a character, but a byte of a character.
Example:
$mb_string = 'žščř'; //4 multi-byte characters in UTF-8
for($i=0; $i < 4; $i++)
{
echo $mb_string[$i], PHP_EOL;
}
Outputs:
Ĺ
ž
Ĺ
Ą
This means I cannot iterate through the string in a loop to check single characters, because I never know if I am in the middle of an character or not.
So the questions are:
http://php.net/mb_string is the thing you're looking for
mb_internal_encoding("UTF-8");
$mb_string = 'žščř';
$l=mb_strlen($mb_string);
for($i=0;$i<$l;$i++){
print(mb_substr($mb_string,$i,1)."<br/>");
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With