Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which encoding opens CSV files correctly with Excel on both Mac and Windows?

We have a web app that exports CSV files containing foreign characters with UTF-8, no BOM. Both Windows and Mac users get garbage characters in Excel. I tried converting to UTF-8 with BOM; Excel/Win is fine with it, Excel/Mac shows gibberish. I'm using Excel 2003/Win, Excel 2011/Mac. Here's all the encodings I tried:

Encoding  BOM      Win                            Mac --------  ---      ----------------------------   ------------ utf-8     --       scrambled                      scrambled utf-8     BOM      WORKS                          scrambled utf-16    --       file not recognized            file not recognized utf-16    BOM      file not recognized            Chinese gibberish utf-16LE  --       file not recognized            file not recognized utf-16LE  BOM      characters OK,                 same as Win                    row data all in first field 

The best one is UTF-16LE with BOM, but the CSV is not recognized as such. The field separator is comma, but semicolon doesn't change things.

Is there any encoding that works in both worlds?

like image 410
Timm Avatar asked Jul 05 '11 19:07

Timm


People also ask

What encoding does Excel use for CSV?

Excel Read csv, set UTF-8 as default for all csv files. Also, you can include the UTF-8 BOM (0xEF,0xBB,0xBF) at the beginning of the file. That will get Microsoft software to recognize it as UTF-8.

What is UTF-8 encoding for a CSV?

UTF-8, or "Unicode Transformation Format, 8 Bit" is a marketing operations pro's best friend when it comes to data imports and exports. It refers to how a file's character data is encoded when moving files between systems.

How do I open a CSV file with UTF-8 in Excel?

Open your CSV file in Microsoft Excel. Click File in the top-left corner of your screen. Click the drop-down menu next to File format. Select CSV UTF-8 (Comma delimited) (.


2 Answers

Excel Encodings

I found the WINDOWS-1252 encoding to be the least frustrating when dealing with Excel. Since its basically Microsofts own proprietary character set, one can assume it will work on both the Mac and the Windows version of MS-Excel. Both versions at least include a corresponding "File origin" or "File encoding" selector which correctly reads the data.

Depending on your system and the tools you use, this encoding could also be named CP1252, ANSI, Windows (ANSI), MS-ANSI or just Windows, among other variations.

This encoding is a superset of ISO-8859-1 (aka LATIN1 and others), so you can fallback to ISO-8859-1 if you cannot use WINDOWS-1252 for some reason. Be advised that ISO-8859-1 is missing some characters from WINDOWS-1252 as shown here:

| Char | ANSI | Unicode | ANSI Hex | Unicode Hex | HTML entity | Unicode Name                               | Unicode Range            | | €    | 128  | 8364    | 0x80     | U+20AC      | €      | euro sign                                  | Currency Symbols         | | ‚    | 130  | 8218    | 0x82     | U+201A      | ‚     | single low-9 quotation mark                | General Punctuation      | | ƒ    | 131  | 402     | 0x83     | U+0192      | ƒ      | Latin small letter f with hook             | Latin Extended-B         | | „    | 132  | 8222    | 0x84     | U+201E      | „     | double low-9 quotation mark                | General Punctuation      | | …    | 133  | 8230    | 0x85     | U+2026      | …    | horizontal ellipsis                        | General Punctuation      | | †    | 134  | 8224    | 0x86     | U+2020      | †    | dagger                                     | General Punctuation      | | ‡    | 135  | 8225    | 0x87     | U+2021      | ‡    | double dagger                              | General Punctuation      | | ˆ    | 136  | 710     | 0x88     | U+02C6      | ˆ      | modifier letter circumflex accent          | Spacing Modifier Letters | | ‰    | 137  | 8240    | 0x89     | U+2030      | ‰    | per mille sign                             | General Punctuation      | | Š    | 138  | 352     | 0x8A     | U+0160      | Š    | Latin capital letter S with caron          | Latin Extended-A         | | ‹    | 139  | 8249    | 0x8B     | U+2039      | ‹    | single left-pointing angle quotation mark  | General Punctuation      | | Œ    | 140  | 338     | 0x8C     | U+0152      | Œ     | Latin capital ligature OE                  | Latin Extended-A         | | Ž    | 142  | 381     | 0x8E     | U+017D      |             | Latin capital letter Z with caron          | Latin Extended-A         | | ‘    | 145  | 8216    | 0x91     | U+2018      | ‘     | left single quotation mark                 | General Punctuation      | | ’    | 146  | 8217    | 0x92     | U+2019      | ’     | right single quotation mark                | General Punctuation      | | “    | 147  | 8220    | 0x93     | U+201C      | “     | left double quotation mark                 | General Punctuation      | | ”    | 148  | 8221    | 0x94     | U+201D      | ”     | right double quotation mark                | General Punctuation      | | •    | 149  | 8226    | 0x95     | U+2022      | •      | bullet                                     | General Punctuation      | | –    | 150  | 8211    | 0x96     | U+2013      | –     | en dash                                    | General Punctuation      | | —    | 151  | 8212    | 0x97     | U+2014      | —     | em dash                                    | General Punctuation      | | ˜    | 152  | 732     | 0x98     | U+02DC      | ˜     | small tilde                                | Spacing Modifier Letters | | ™    | 153  | 8482    | 0x99     | U+2122      | ™     | trade mark sign                            | Letterlike Symbols       | | š    | 154  | 353     | 0x9A     | U+0161      | š    | Latin small letter s with caron            | Latin Extended-A         | | ›    | 155  | 8250    | 0x9B     | U+203A      | ›    | single right-pointing angle quotation mark | General Punctuation      | | œ    | 156  | 339     | 0x9C     | U+0153      | œ     | Latin small ligature oe                    | Latin Extended-A         | | ž    | 158  | 382     | 0x9E     | U+017E      |             | Latin small letter z with caron            | Latin Extended-A         | | Ÿ    | 159  | 376     | 0x9F     | U+0178      | Ÿ      | Latin capital letter Y with diaeresis      | Latin Extended-A         | 

Note that the euro sign is missing. This table can be found at Alan Wood.

Conversion

Conversion is done differently in every tool and language. However, suppose you have a file query_result.csv which you know is UTF-8 encoded. Convert it to WINDOWS-1252 using iconv:

iconv -f UTF-8 -t WINDOWS-1252 query_result.csv > query_result-win.csv 
like image 121
mikezter Avatar answered Sep 25 '22 04:09

mikezter


For UTF-16LE with BOM if you use tab characters as your delimiters instead of commas Excel will recognise the fields. The reason it works is that Excel actually ends up using its Unicode *.txt parser.

Caveat: If the file is edited in Excel and saved, it will be saved as tab-delimited ASCII. The problem now is that when you re-open the file Excel assumes it's real CSV (with commas), sees that it's not Unicode, so parses it as comma-delimited - and hence will make a hash of it!

Update: The above caveat doesn't appear to be happening for me today in Excel 2010 (Windows) at least, although there does appear to be a difference in saving behaviour if:

  • you edit and quit Excel (tries to save as 'Unicode *.txt')

compared to:

  • editing and closing just the file (works as expected).
like image 26
Duncan Smart Avatar answered Sep 25 '22 04:09

Duncan Smart