Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Microsoft Excel mangles Diacritics in .csv files?

People also ask

Can a CSV file have special characters?

So if you're working with CSV, what special characters are actually supported? Technically, all of them! There are no specified limits of what characters can be used in a CSV file.


A correctly formatted UTF8 file can have a Byte Order Mark as its first three octets. These are the hex values 0xEF, 0xBB, 0xBF. These octets serve to mark the file as UTF8 (since they are not relevant as "byte order" information).1 If this BOM does not exist, the consumer/reader is left to infer the encoding type of the text. Readers that are not UTF8 capable will read the bytes as some other encoding such as Windows-1252 and display the characters  at the start of the file.

There is a known bug where Excel, upon opening UTF8 CSV files via file association, assumes that they are in a single-byte encoding, disregarding the presence of the UTF8 BOM. This can not be fixed by any system default codepage or language setting. The BOM will not clue in Excel - it just won't work. (A minority report claims that the BOM sometimes triggers the "Import Text" wizard.) This bug appears to exist in Excel 2003 and earlier. Most reports (amidst the answers here) say that this is fixed in Excel 2007 and newer.

Note that you can always* correctly open UTF8 CSV files in Excel using the "Import Text" wizard, which allows you to specify the encoding of the file you're opening. Of course this is much less convenient.

Readers of this answer are most likely in a situation where they don't particularly support Excel < 2007, but are sending raw UTF8 text to Excel, which is misinterpreting it and sprinkling your text with à and other similar Windows-1252 characters. Adding the UTF8 BOM is probably your best and quickest fix.

If you are stuck with users on older Excels, and Excel is the only consumer of your CSVs, you can work around this by exporting UTF16 instead of UTF8. Excel 2000 and 2003 will double-click-open these correctly. (Some other text editors can have issues with UTF16, so you may have to weigh your options carefully.)


* Except when you can't, (at least) Excel 2011 for Mac's Import Wizard does not actually always work with all encodings, regardless of what you tell it. </anecdotal-evidence> :)


Prepending a BOM (\uFEFF) worked for me (Excel 2007), in that Excel recognised the file as UTF-8. Otherwise, saving it and using the import wizard works, but is less ideal.


Below is the PHP code I use in my project when sending Microsoft Excel to user:

  /**
   * Export an array as downladable Excel CSV
   * @param array   $header
   * @param array   $data
   * @param string  $filename
   */
  function toCSV($header, $data, $filename) {
    $sep  = "\t";
    $eol  = "\n";
    $csv  =  count($header) ? '"'. implode('"'.$sep.'"', $header).'"'.$eol : '';
    foreach($data as $line) {
      $csv .= '"'. implode('"'.$sep.'"', $line).'"'.$eol;
    }
    $encoded_csv = mb_convert_encoding($csv, 'UTF-16LE', 'UTF-8');
    header('Content-Description: File Transfer');
    header('Content-Type: application/vnd.ms-excel');
    header('Content-Disposition: attachment; filename="'.$filename.'.csv"');
    header('Content-Transfer-Encoding: binary');
    header('Expires: 0');
    header('Cache-Control: must-revalidate, post-check=0, pre-check=0');
    header('Pragma: public');
    header('Content-Length: '. strlen($encoded_csv));
    echo chr(255) . chr(254) . $encoded_csv;
    exit;
  }

UPDATED: Filename improvement and BUG fix correct length calculation. Thanks to TRiG and @ivanhoe011


The answer for all combinations of Excel versions (2003 + 2007) and file types

Most other answers here concern their Excel version only and will not necessarily help you, because their answer just might not be true for your version of Excel.

For example, adding the BOM character introduces problems with automatic column separator recognition, but not with every Excel version.

There are 3 variables that determines if it works in most Excel versions:

  • Encoding
  • BOM character presence
  • Cell separator

Somebody stoic at SAP tried every combination and reported the outcome. End result? Use UTF16le with BOM and tab character as separator to have it work in most Excel versions.

You don't believe me? I wouldn't either, but read here and weep: http://wiki.sdn.sap.com/wiki/display/ABAP/CSV+tests+of+encoding+and+column+separator


select UTF-8 enconding when importing. if you use Office 2007 this is where you chose it : right after you open the file.


Echo UTF-8 BOM before outputing CSV data. This fixes all character issues in Windows but doesnt work for Mac.

echo "\xEF\xBB\xBF";

It works for me because I need to generate a file which will be used on Windows PCs only.


UTF-8 doesn't work for me in office 2007 without any service pack, with or without BOM (U+ffef or 0xEF,0xBB,0xBF , neither works) installing sp3 makes UTF-8 work when 0xEF,0xBB,0xBF BOM is prepended.

UTF-16 works when encoding in python using "utf-16-le" with a 0xff 0xef BOM prepended, and using tab as seperator. I had to manually write out the BOM, and then use "utf-16-le" rather then "utf-16", otherwise each encode() prepended the BOM to every row written out which appeared as garbage on the first column of the second line and after.

can't tell whether UTF-16 would work without any sp installed, since I can't go back now. sigh

This is on windows, dunno about office for MAC.

for both working cases, the import works when launching a download directly from the browser and the text import wizard doesn't intervence, it works like you would expect.