We have a web app that exports CSV files containing foreign characters with UTF-8, no BOM. Both Windows and Mac users get garbage characters in Excel. I tried converting to UTF-8 with BOM; Excel/Win is fine with it, Excel/Mac shows gibberish. I'm using Excel 2003/Win, Excel 2011/Mac. Here's all the encodings I tried:
Encoding BOM Win Mac -------- --- ---------------------------- ------------ utf-8 -- scrambled scrambled utf-8 BOM WORKS scrambled utf-16 -- file not recognized file not recognized utf-16 BOM file not recognized Chinese gibberish utf-16LE -- file not recognized file not recognized utf-16LE BOM characters OK, same as Win row data all in first field
The best one is UTF-16LE with BOM, but the CSV is not recognized as such. The field separator is comma, but semicolon doesn't change things.
Is there any encoding that works in both worlds?
Excel Read csv, set UTF-8 as default for all csv files. Also, you can include the UTF-8 BOM (0xEF,0xBB,0xBF) at the beginning of the file. That will get Microsoft software to recognize it as UTF-8.
UTF-8, or "Unicode Transformation Format, 8 Bit" is a marketing operations pro's best friend when it comes to data imports and exports. It refers to how a file's character data is encoded when moving files between systems.
Open your CSV file in Microsoft Excel. Click File in the top-left corner of your screen. Click the drop-down menu next to File format. Select CSV UTF-8 (Comma delimited) (.
I found the WINDOWS-1252
encoding to be the least frustrating when dealing with Excel. Since its basically Microsofts own proprietary character set, one can assume it will work on both the Mac and the Windows version of MS-Excel. Both versions at least include a corresponding "File origin" or "File encoding" selector which correctly reads the data.
Depending on your system and the tools you use, this encoding could also be named CP1252
, ANSI
, Windows (ANSI)
, MS-ANSI
or just Windows
, among other variations.
This encoding is a superset of ISO-8859-1
(aka LATIN1
and others), so you can fallback to ISO-8859-1
if you cannot use WINDOWS-1252
for some reason. Be advised that ISO-8859-1
is missing some characters from WINDOWS-1252
as shown here:
| Char | ANSI | Unicode | ANSI Hex | Unicode Hex | HTML entity | Unicode Name | Unicode Range | | € | 128 | 8364 | 0x80 | U+20AC | € | euro sign | Currency Symbols | | ‚ | 130 | 8218 | 0x82 | U+201A | ‚ | single low-9 quotation mark | General Punctuation | | ƒ | 131 | 402 | 0x83 | U+0192 | ƒ | Latin small letter f with hook | Latin Extended-B | | „ | 132 | 8222 | 0x84 | U+201E | „ | double low-9 quotation mark | General Punctuation | | … | 133 | 8230 | 0x85 | U+2026 | … | horizontal ellipsis | General Punctuation | | † | 134 | 8224 | 0x86 | U+2020 | † | dagger | General Punctuation | | ‡ | 135 | 8225 | 0x87 | U+2021 | ‡ | double dagger | General Punctuation | | ˆ | 136 | 710 | 0x88 | U+02C6 | ˆ | modifier letter circumflex accent | Spacing Modifier Letters | | ‰ | 137 | 8240 | 0x89 | U+2030 | ‰ | per mille sign | General Punctuation | | Š | 138 | 352 | 0x8A | U+0160 | Š | Latin capital letter S with caron | Latin Extended-A | | ‹ | 139 | 8249 | 0x8B | U+2039 | ‹ | single left-pointing angle quotation mark | General Punctuation | | Œ | 140 | 338 | 0x8C | U+0152 | Œ | Latin capital ligature OE | Latin Extended-A | | Ž | 142 | 381 | 0x8E | U+017D | | Latin capital letter Z with caron | Latin Extended-A | | ‘ | 145 | 8216 | 0x91 | U+2018 | ‘ | left single quotation mark | General Punctuation | | ’ | 146 | 8217 | 0x92 | U+2019 | ’ | right single quotation mark | General Punctuation | | “ | 147 | 8220 | 0x93 | U+201C | “ | left double quotation mark | General Punctuation | | ” | 148 | 8221 | 0x94 | U+201D | ” | right double quotation mark | General Punctuation | | • | 149 | 8226 | 0x95 | U+2022 | • | bullet | General Punctuation | | – | 150 | 8211 | 0x96 | U+2013 | – | en dash | General Punctuation | | — | 151 | 8212 | 0x97 | U+2014 | — | em dash | General Punctuation | | ˜ | 152 | 732 | 0x98 | U+02DC | ˜ | small tilde | Spacing Modifier Letters | | ™ | 153 | 8482 | 0x99 | U+2122 | ™ | trade mark sign | Letterlike Symbols | | š | 154 | 353 | 0x9A | U+0161 | š | Latin small letter s with caron | Latin Extended-A | | › | 155 | 8250 | 0x9B | U+203A | › | single right-pointing angle quotation mark | General Punctuation | | œ | 156 | 339 | 0x9C | U+0153 | œ | Latin small ligature oe | Latin Extended-A | | ž | 158 | 382 | 0x9E | U+017E | | Latin small letter z with caron | Latin Extended-A | | Ÿ | 159 | 376 | 0x9F | U+0178 | Ÿ | Latin capital letter Y with diaeresis | Latin Extended-A |
Note that the euro sign is missing. This table can be found at Alan Wood.
Conversion is done differently in every tool and language. However, suppose you have a file query_result.csv
which you know is UTF-8
encoded. Convert it to WINDOWS-1252
using iconv
:
iconv -f UTF-8 -t WINDOWS-1252 query_result.csv > query_result-win.csv
For UTF-16LE with BOM if you use tab characters as your delimiters instead of commas Excel will recognise the fields. The reason it works is that Excel actually ends up using its Unicode *.txt parser.
Caveat: If the file is edited in Excel and saved, it will be saved as tab-delimited ASCII. The problem now is that when you re-open the file Excel assumes it's real CSV (with commas), sees that it's not Unicode, so parses it as comma-delimited - and hence will make a hash of it!
Update: The above caveat doesn't appear to be happening for me today in Excel 2010 (Windows) at least, although there does appear to be a difference in saving behaviour if:
compared to:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With