Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to normalize encoding names, like ks_c_5601-1987 to CP949?

I am fetching emails from a mail server and converting the message to UTF-8 charset and save it in DB.To convert the charset I am using mb_convert_encoding but it fails to convert gb2312 and ks_c_5601-1987. On googling I found that instead of gb2312 I can use CP936 and for ks_c_5601-1987 use CP949.

Going by the above approach it would mean to maintain a separate list of charset mappings in my code. Is there a way to normalize names of encodings to names internally supported by PHP hence eliminating the need to maintain any map locally?

like image 941
Nidhi Kaushal Avatar asked Dec 10 '12 09:12

Nidhi Kaushal


1 Answers

According to the list of supported character encodings there are only a small number of encodings listed explicitly by code page. Given the small number of these cases - whilst not a built-in normalisation as requested - a list of mappings may not be too inappropriate.

The relevant ones appear to be the following (the lowercase name on the right is the name you'll need to convert from):

  • CP932 shift_jis
  • CP51932 euc_jp
  • CP50220 iso-2022-jp
  • CP50221 csISO220JP
  • CP50222 iso-2022-jp
  • CP936 gb2312
  • CP950 big5

The following are also listed by code-page on the PHP documentation but appear to have suitable synonyms already:

  • CP866 (IBM866)
  • UHC (CP949)
  • Windows-1251 (CP1251)
  • Windows-1252 (CP1252)
like image 141
borrible Avatar answered Sep 18 '22 15:09

borrible