Corrupted data using UTF-8 and mb_substr

Question

I'm get data from MySQL db, varchar(255) utf8_general_ci field and try to write the text to a PDF with PHP. I need to determine the string length in the PDF to limit the output of the text in a table. But I noticed that the output of mb_substr/substr is really strange.

For example:

mb_internal_encoding("UTF-8");

$_tmpStr = $vfrow['title'];
$_tmpStrLen = mb_strlen($vfrow['title']);
for($i=$_tmpStrLen; $i >= 0; $i--){
     file_put_contents('cutoffattributes.txt',$vfrow['field']." ".$_tmpStr."
",FILE_APPEND);
     file_put_contents('cutoffattributes.txt',$vfrow['field']." ".mb_substr($_tmpStr, 0, $i)."
",FILE_APPEND);
}

outputs this:

screen shot from npp

npp file link

Database:

enter image description here

My question is where does the extra character come from?

deceze · Accepted Answer

You need to ensure you're actually getting the data from the database in UTF-8 encoding by setting your connection encoding appropriately. This depends on your database adapter, see UTF-8 all the way through for details.
You need to tell your mb_ functions that the data is in UTF-8 so they can treat it correctly. Either set this globally for all functions using mb_internal_encoding, or pass the $encoding parameter to your function when you call it:
```
mb_substr($_tmpStr, 0, $i, 'UTF-8')
```

Michas · Answer

The extra character is first part of two byte UTF-8 sequence. You may have problems with internal encoding of Multibyte String Functions. Your code treats text as fixed, 1-byte encoding. The ń in UTF-8, hex C5 84, is treated as Ĺ„ in CP-1250 and Ĺ_[IND] in ISO-8859-2, two characters.

Try to execute this one on the top of script:

mb_internal_encoding("UTF-8");

http://php.net/manual/en/function.mb-internal-encoding.php

Try to execute this one on the top of script:

mb_internal_encoding("UTF-8");

http://php.net/manual/en/function.mb-internal-encoding.php

Corrupted data using UTF-8 and mb_substr

Tags:

php

utf-8

mbstring

aLx13

2 Answers

deceze

Michas

Recent Activity

Donate For Us

Corrupted data using UTF-8 and mb_substr

Tags:

php

utf-8

mbstring

aLx13

2 Answers

deceze

Michas

Related questions

Recent Activity

Donate For Us