Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I find the number of bytes within UTF-8 string with PHP?

I have the following function from the php.net site to determine the # of bytes in an ASCII and UTF-8 string:

<?php 
/** 
 * Count the number of bytes of a given string. 
 * Input string is expected to be ASCII or UTF-8 encoded. 
 * Warning: the function doesn't return the number of chars 
 * in the string, but the number of bytes. 
 * 
 * @param string $str The string to compute number of bytes 
 * 
 * @return The length in bytes of the given string. 
 */ 
function strBytes($str) 
{ 
  // STRINGS ARE EXPECTED TO BE IN ASCII OR UTF-8 FORMAT 

  // Number of characters in string 
  $strlen_var = strlen($str); 

  // string bytes counter 
  $d = 0; 

 /* 
  * Iterate over every character in the string, 
  * escaping with a slash or encoding to UTF-8 where necessary 
  */ 
  for ($c = 0; $c < $strlen_var; ++$c) { 

      $ord_var_c = ord($str{$d}); 

      switch (true) { 
          case (($ord_var_c >= 0x20) && ($ord_var_c <= 0x7F)): 
              // characters U-00000000 - U-0000007F (same as ASCII) 
              $d++; 
              break; 

          case (($ord_var_c & 0xE0) == 0xC0): 
              // characters U-00000080 - U-000007FF, mask 110XXXXX 
              // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 
              $d+=2; 
              break; 

          case (($ord_var_c & 0xF0) == 0xE0): 
              // characters U-00000800 - U-0000FFFF, mask 1110XXXX 
              // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 
              $d+=3; 
              break; 

          case (($ord_var_c & 0xF8) == 0xF0): 
              // characters U-00010000 - U-001FFFFF, mask 11110XXX 
              // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 
              $d+=4; 
              break; 

          case (($ord_var_c & 0xFC) == 0xF8): 
              // characters U-00200000 - U-03FFFFFF, mask 111110XX 
              // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 
              $d+=5; 
              break; 

          case (($ord_var_c & 0xFE) == 0xFC): 
              // characters U-04000000 - U-7FFFFFFF, mask 1111110X 
              // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 
              $d+=6; 
              break; 
          default: 
            $d++;    
      } 
  } 

  return $d; 
} 
?> 

However when I try this with Russian (e.g. По своей природе компьютеры могут работать лишь с числами. И для того, чтобы они могли хранить в памяти буквы или другие символы, каждому такому символу должно быть поставлено в соответствие число.). It doesn't seem to return the correct number of bytes.

The switch statement is using the default condition. Any ideas why Russian characters would not be working as expected? Or would there be better options for this.

I am asking this as I need to shorten a UTF-8 string to a certain number of bytes. i.e. I can only send a max. of 169 bytes of JSON data to the iPhone APNS in my situation (excluding the other packet data).

Reference: PHP strlen - Manual (Paolo Comment on 10-Jan-2007 03:58)

like image 581
Luke Avatar asked Mar 05 '10 02:03

Luke


2 Answers

I am asking this as I need to shorten a utf-8 string to a certain number of bytes.

mb_strcut() does exactly this, though you might not be able to tell from the barely comprehensible documentation.

like image 53
Michael Borgwardt Avatar answered Sep 18 '22 00:09

Michael Borgwardt


strlen() returns the number of bytes.

Shortening a multibyte string to a certain number of bytes is a separate task. You will need to take care not to chop the string off in the middle of a multibyte sequence as you shorten it.

The other thing you need to handle is that when you put a string into json notation, it might need more bytes to represent it as json. For example, if your string contains a double quote character. It needs to be escaped, and the backslash character will add one byte. There's other characters that need to be escaped too. Point is, it can get larger. I assume the byte limit is on the total json payload, so you do need to account for the json syntax itself, as well as any escaping that json will impose on your string.

An unoptimized, kinda hacky way to do it is to chop the string, at say 5 bytes more than your limit, using substr(). Now use mb_strlen() to get number of characters, and mb_substr() to remove the last character. Now encode it as json, and measure the bytes via strlen(). Enter a loop, which keeps chopping off the last character using mb_substr(), encodes as json, and again measure bytes using strlen(). The loop terminates when the number of bytes is acceptable.

like image 33
goat Avatar answered Sep 22 '22 00:09

goat