I am working for international clients who have all very different alphabets and so I am trying to finally get an overview of a complete workflow between PHP and MySQL that would ensure all character encodings to be inserted correctly. I have read a bunch of tutorials on this but still have questions(there is much to learn) and thought I might just put it all together here and ask.
PHP
header('Content-Type:text/html; charset=UTF-8');
mb_internal_encoding('UTF-8');
HTML
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
<form accept-charset="UTF-8"> .. </form>
(though the later is optional and rather a suggestion but I belief I'd rather suggest as not doing anything)
MySQL
CREATE database_name DEFAULT CHARACTER SET utf8;
or ALTER database_name DEFAULT CHARACTER SET utf8;
and/or use utf8_general_ci
as MySQL connection collation.
(it is important to note here that this will increase the database size if it uses varchar)
Connection
mysql_query("SET NAMES 'utf8'");
mysql_query("SET CHARACTER_SET utf8");
Businesses logic
detect if not UTF8 with mb_detect_encoding()
and convert with ivon()
.
validating overly long sequences of UTF8 and UTF16
$body=preg_replace('/[\x00-\x08\x10\x0B\x0C\x0E-\x19\x7F]|(?<=^|[\x00-\x7F])[\x80-\xBF]+|([\xC0\xC1]|[\xF0-\xFF])[\x80-\xBF]*|[\xC2-\xDF]((?![\x80-\xBF])|[\x80-\xBF]{2,})|[\xE0-\xEF](([\x80-\xBF](?![\x80-\xBF]))|(?![\x80-\xBF]{2})|[\x80-\xBF]{3,})/','�',$body);
$body=preg_replace('/\xE0[\x80-\x9F][\x80-\xBF]|\xED[\xA0-\xBF][\x80-\xBF]/S','?', $body);
Questions
is mb_internal_encoding('UTF-8')
necessary in PHP 5.3 and higher and if so does this mean I have to use all multi byte functions instead of its core functions like mb_substr()
instead of substr()
?
is it still necessary to check for malformed input stings and if so what is a reliable function/class to do so? I possibly do not want to strip bad data and don't know enough about transliteration.
should it really be utf8_general_ci
or rather utf8_bin
?
is there something missing in the above workflow?
sources:
http://coding.smashingmagazine.com/2012/06/06/all-about-unicode-utf8-character-sets/
http://webcollab.sourceforge.net/unicode.html
http://stackoverflow.com/a/3742879/1043231
http://www.adayinthelifeof.nl/2010/12/04/about-using-utf-8-fields-in-mysql/
http://akrabat.com/php/utf8-php-and-mysql/
mb_internal_encoding('UTF-8')
doesn't do anything by itself, it only sets the default encoding parameter for each mb_
function. If you're not using any mb_
function, it doesn't make any difference. If you are, it makes sense to set it so you don't have to pass the $encoding
parameter each time individually.mb_detect_encoding
is mostly useless since it's fundamentally impossible to accurately detect the encoding of unknown text. You should either know what encoding a blob of text is in because you have a specification about it, or you need to parse appropriate meta data like headers or meta tags where the encoding is specified.mb_check_encoding
to check if a blob of text is valid in the encoding you expect it to be in is typically sufficient. If it's not, discard it and throw an appropriate error.Regarding:
does this mean I have to use all multi byte functions instead of its core functions
If you are manipulating strings that contain multibyte characters, then yes, you need to use the mb_
functions to avoid getting wrong results. The core string functions only work on a byte level, not a character level, which is what you typically want when working with strings.
utf8_general_ci
vs. utf8_bin
only makes a difference when collating, i.e. sorting and comparing strings. With utf8_bin
data is treated in binary form, i.e. only identical data is identical. With utf8_general_ci
some logic is applied, e.g. "é" sorts together with "e" and upper case is considered equal to lower case.should it really be utf8_general_ci or rather utf8_bin?
You must use utf8_bin for Case-sensitive search, otherwise utf8_general_ci
is mb_internal_encoding('UTF-8') necessary in PHP 5.3 and higher and if so does this mean I have to use all multi byte functions instead of its core functions like mb_substr() instead of substr()?
Yes of course, If you have a multibyte string you need mb_* family function to work with, except for binary safe php standard function like str_replace(); (and few others)
is it still necessary to check for malformed input stings and if so what is a reliable function/class to do so? I possibly do not want to strip bad data and don't know enough about transliteration.
Hmm, no you can't check it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With