Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

check if is multibyte string in PHP

I want to check if is a string type multibyte on PHP. Have any idea how to accomplish this?

Example:

<?php!
$string = "I dont have idea that is what i am...";
if( is_multibyte( $string ) )
{
    echo 'yes!!';
}else{
    echo 'ups!';
}
?>

Maybe( rule 8 bytes ):

<?php
if( mb_strlen( $string ) > strlen() )
{
    return true;
}
else
{
    return false;
}
?>

I read: Variable width encoding - WIKI and UTF-8 - WIKI

like image 937
Jorge Olaf Avatar asked May 29 '13 18:05

Jorge Olaf


People also ask

What is multibyte string PHP?

Mbstring stands for multi-byte string functions. Mbstring is an extension of php used to manage non-ASCII strings. Mbstring is used to convert strings to different encodings. Multibyte character encoding schemes are used to express more than 256 characters in the regular byte wise coding system.

What is multibyte string?

A null-terminated multibyte string (NTMBS), or "multibyte string", is a sequence of nonzero bytes followed by a byte with value zero (the terminating null character). Each character stored in the string may occupy more than one byte.

How do I check if a string is UTF 8 in PHP?

is_utf8() – check for UTF-8 With this PHP function it's possible to check whether a string is encoded as UTF-8 or not, or seems to be, at least. It scans a string for invalid UTF-8 characters (or bytes) and returns false, if it finds any.

What is multibyte char?

Each byte sequence represents a single character in the extended character set. Multibyte characters are used in character sets such as Kanji. Wide characters are multilingual character codes that are always 16 bits wide. The type for character constants is char ; for wide characters, the type is wchar_t .


3 Answers

There are two interpretations. The first is that every character is multibyte. The second is that the string contains one multibyte character at least. If you have an interest for handling invalid byte sequence, see https://stackoverflow.com/a/13695364/531320 for details.

function is_all_multibyte($string)
{
    // check if the string doesn't contain invalid byte sequence
    if (mb_check_encoding($string, 'UTF-8') === false) return false;

    $length = mb_strlen($string, 'UTF-8');

    for ($i = 0; $i < $length; $i += 1) {

        $char = mb_substr($string, $i, 1, 'UTF-8');

        // check if the string doesn't contain single character
        if (mb_check_encoding($char, 'ASCII')) {

            return false;

        }

    }

    return true;

}

function contains_any_multibyte($string)
{
    return !mb_check_encoding($string, 'ASCII') && mb_check_encoding($string, 'UTF-8');
}

$data = ['東京', 'Tokyo', '東京(Tokyo)'];

var_dump(
    [true, false, false] ===
    array_map(function($v) {
        return is_all_multibyte($v);
    },
    $data),
    [true, false, true] ===
    array_map(function($v) {
        return contains_any_multibyte($v);
    },
    $data)
);
like image 73
masakielastic Avatar answered Oct 28 '22 22:10

masakielastic


I'm not sure if there's a better way, but a quick way that comes in mind is:

if (mb_strlen($str) != strlen($str)) {
    echo "yes";
} else {
    echo "no";
}
like image 8
periklis Avatar answered Oct 28 '22 22:10

periklis


To determine if something is multibyte or not you need to be specific about which character set you're using. If your character set is Latin1, for example, no strings will be multibyte. If your character set is UTF-16, every string is multibyte.

That said, if you only care about a specific character set, say utf-8, you can use a mb_strlen < strlen test if you specify the encoding parameter explicitly.

function is_multibyte($s) {
  return mb_strlen($s,'utf-8') < strlen($s);
}
like image 2
James Holderness Avatar answered Oct 28 '22 21:10

James Holderness