Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

There are simple way to get a character from multibyte string in PHP?

This is my problem: My language (Portuguese) uses ISO-8859-1 char encoding! When I want access a character from a string like 'coração' (heart) I use:

mb_internal_encoding('ISO-8859-1');
$str = "coração";

$len = mb_strlen($str,'UTF-8');

for($i=0;$i<$len;++$i)
    echo mb_substr($str, $i, 1, 'UTF-8')."<br/>";

This produces:

c
o
r
a
ç
ã
o

This works fine... But my issue is if the use of mb_substr function is not fast as simple string normal access! But I want a simple way to do this.... like in normal string character access: echo $str[$pos].... It is possible?

like image 770
Lucas Batistussi Avatar asked Apr 28 '12 05:04

Lucas Batistussi


People also ask

What is multibyte string PHP?

Mbstring stands for multi-byte string functions. Mbstring is an extension of php used to manage non-ASCII strings. Mbstring is used to convert strings to different encodings. Multibyte character encoding schemes are used to express more than 256 characters in the regular byte wise coding system.

What is a multibyte string?

A null-terminated multibyte string (NTMBS), or "multibyte string", is a sequence of nonzero bytes followed by a byte with value zero (the terminating null character). Each character stored in the string may occupy more than one byte.


2 Answers

mb_substr function is not fast as [...] like in normal string character access: echo $str[$pos].... It is possible?

No.

  • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
  • Premature optimization

The multibyte functions have to check every character to determine how many bytes (1 to 4 in UTF-8) it occupies. There you immediately have the reason why character indexing ($a[n]) won't work: you don't know what byte(s) you need to get the n th character before you've read all characters before that one.

To speed things up a bit, you can look at the answers here: How to iterate UTF-8 string in PHP?

However, since you use ISO 8859-1 or Latin-1, you don't have to use the mb_ functions at all, since in that encoding all characters are encoded in one byte.

like image 170
CodeCaster Avatar answered Oct 07 '22 10:10

CodeCaster


Try:

preg_match_all( "/./u", $str, $ar_chars );
print_r( $ar_chars ); 
like image 1
tty01 Avatar answered Oct 07 '22 09:10

tty01