Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting Mac Roman character to equivalent UTF-8

I have been given some HTML files that use the Mac OS Roman file encoding. The files have French text, but in an editor many of the diacritical chars look strange (i.e. non French)

Si cette option est sÈlectionnÈe, <removed> tentera de communiquer avec votre tÈlescope seulement ‡ líaide díun ...

The capital E with accent does display properly in the browser as é as do the other strange characters.

I also have some UTF-8 French files that look normal in an editor (é looks like é). What I'd like to do is convert all the Mac Roman files to UTF-8 for easier maintenance.

Simply changing the file encoding in the editor doesn't do this. The strange characters are still strange.

Short of making a conversion dictionary and doing a Find/Replace on all the files, is there a way to do this?

like image 204
btschumy Avatar asked Jul 09 '13 22:07

btschumy


2 Answers

If your editor isn’t showing it correctly when you specify the encoding, you have given it the wrong encoding. You need to figure what encoding you really have.

You appear to have a byte valued 0xE9 where you need a Unicode LATIN SMALL LETTER E WITH ACUTE character. A MacRoman 0xE9 byte is a LATIN CAPITAL LETTER E WITH GRAVE character, which is what your editor is displaying because you said it was MacRoman. But it is not.

However, Unicode code point U+00E9 is indeed LATIN SMALL LETTER E WITH ACUTE.

Therefore, it is not MacRoman that you have there, but almost certainly ISO-8859-1 or ISO-8859-15.

So use something like

$ iconv -f ISO-8859-1 -t UTF-8 < input.latin1 > output.utf8

to do the conversion.

like image 193
tchrist Avatar answered Nov 17 '22 18:11

tchrist


To actually answer the question "Converting Mac Roman character to equivalent UTF-8"

Convert the encoding of the file from Mac OS Roman to UTF-8:

$ iconv -f macintosh -t UTF-8 < INPUT_FILE_PATH > OUTPUT_FILE_PATH
like image 38
Richard de Wit Avatar answered Nov 17 '22 17:11

Richard de Wit