I recently added a CSV-download button that takes data from database (Postgres) an array from server (Ruby on Rails), and turns it into a CSV file on the client side (Javascript, HTML5). I'm currently testing the CSV file and I am coming across some encoding issues. When I view the CSV file via 'less', the file appears fine. But when I open the file in Excel OR TextEdit, I start seeing weird characters like <blockquote> â€”, â€, â€&oelig; </blockquote> appear in the text. Basically, I see the characters that are described here: http://digwp.com/2011/07/clean-up-weird-characters-in-database/ I read that this sort of issue can arise when the Database encoding setting is set to the wrong one. BUT, the database that I am using is set to use UTF8 encoding. And when I debug through the JS codes that create the CSV file, the text appear normal. (This could be a Chrome ability, and less capability) I'm feeling frustrated because the only thing I am learning from my online search is that there could be many reasons why encoding is not working, I'm not sure which part is at fault (so excuse me as I initially tag numerous things), and nothing I tried has shed new light on my problem. For reference, here's the JavaScript snippet that creates the CSV file! <pre class="prettyprint"><code>$(document).ready(function() { var csvData = <%= raw to_csv(@view_scope, clicks_post).as_json %>; var csvContent = "data:text/csv;charset=utf-8,"; csvData.forEach(function(infoArray, index){ var dataString = infoArray.join(","); csvContent += dataString+ "\n"; }); var encodedUri = encodeURI(csvContent); var button = $('<a>'); button.text('Download CSV'); button.addClass("button right"); button.attr('href', encodedUri); button.attr('target','_blank'); button.attr('download','<%=title%>_25_posts.csv'); $("#<%=title%>_download_action").append(button); }); </code></pre>

As @jlarson updated with information that Mac was the biggest culprit we might get some further. Office for Mac has, at least 2011 and back, rather poor support for reading Unicode formats when importing files. Support for UTF-8 seems to be close to non-existent, have read a tiny few comments about it working, whilst the majority say it does not. Unfortunately I do not have any Mac to test on. So again: The files themselves should be OK as UTF-8, but the import halts the process. Wrote up a quick test in Javascript for exporting percent escaped UTF-16 little and big endian, with- / without BOM etc. Code should probably be refactored but should be OK for testing. It might work better then UTF-8. Of course this also usually means bigger data transfers as any glyph is two or four bytes. You can find a fiddle here: <blockquote> <kbd>Unicode export sample Fiddle</kbd> </blockquote> Note that it does not handle CSV in any particular way. It is mainly meant for pure conversion to data URL having UTF-8, UTF-16 big/little endian and +/- BOM. There is one option in the fiddle to replace commas with tabs, – but believe that would be rather hackish and fragile solution if it works. <hr> Typically use like: <pre class="prettyprint"><code>// Initiate encoder = new DataEnc({ mime : 'text/csv', charset: 'UTF-16BE', bom : true }); // Convert data to percent escaped text encoder.enc(data); // Get result var result = encoder.pay(); </code></pre> There is two result properties of the object: 1.) <code>encoder.lead</code> This is the mime-type, charset etc. for data URL. Built from options passed to initializer, or one can also say <code>.config({ ... new conf ...}).intro()</code> to re-build. <pre class="prettyprint"><code>data:[<MIME-type>][;charset=<encoding>][;base64] </code></pre> You can specify base64, but there is no base64 conversion (at least not this far). 2.) <code>encoder.buf</code> This is a string with the percent escaped data. The <code>.pay()</code> function simply return 1.) and 2.) as one. <hr> <h3>Main code:</h3> <hr> <pre class="prettyprint"><code>function DataEnc(a) { this.config(a); this.intro(); } /* * http://www.iana.org/assignments/character-sets/character-sets.xhtml * */ DataEnc._enctype = { u8 : ['u8', 'utf8'], // RFC-2781, Big endian should be presumed if none given u16be : ['u16', 'u16be', 'utf16', 'utf16be', 'ucs2', 'ucs2be'], u16le : ['u16le', 'utf16le', 'ucs2le'] }; DataEnc._BOM = { 'none' : '', 'UTF-8' : '%ef%bb%bf', // Discouraged 'UTF-16BE' : '%fe%ff', 'UTF-16LE' : '%ff%fe' }; DataEnc.prototype = { // Basic setup config : function(a) { var opt = { charset: 'u8', mime : 'text/csv', base64 : 0, bom : 0 }; a = a || {}; this.charset = typeof a.charset !== 'undefined' ? a.charset : opt.charset; this.base64 = typeof a.base64 !== 'undefined' ? a.base64 : opt.base64; this.mime = typeof a.mime !== 'undefined' ? a.mime : opt.mime; this.bom = typeof a.bom !== 'undefined' ? a.bom : opt.bom; this.enc = this.utf8; this.buf = ''; this.lead = ''; return this; }, // Create lead based on config // data:[<MIME-type>][;charset=<encoding>][;base64],<data> intro : function() { var g = [], c = this.charset || '', b = 'none' ; if (this.mime && this.mime !== '') g.push(this.mime); if (c !== '') { c = c.replace(/[-\s]/g, '').toLowerCase(); if (DataEnc._enctype.u8.indexOf(c) > -1) { c = 'UTF-8'; if (this.bom) b = c; this.enc = this.utf8; } else if (DataEnc._enctype.u16be.indexOf(c) > -1) { c = 'UTF-16BE'; if (this.bom) b = c; this.enc = this.utf16be; } else if (DataEnc._enctype.u16le.indexOf(c) > -1) { c = 'UTF-16LE'; if (this.bom) b = c; this.enc = this.utf16le; } else { if (c === 'copy') c = ''; this.enc = this.copy; } } if (c !== '') g.push('charset=' + c); if (this.base64) g.push('base64'); this.lead = 'data:' + g.join(';') + ',' + DataEnc._BOM[b]; return this; }, // Deliver pay : function() { return this.lead + this.buf; }, // UTF-16BE utf16be : function(t) { // U+0500 => %05%00 var i, c, buf = []; for (i = 0; i < t.length; ++i) { if ((c = t.charCodeAt(i)) > 0xff) { buf.push(('00' + (c >> 0x08).toString(16)).substr(-2)); buf.push(('00' + (c & 0xff).toString(16)).substr(-2)); } else { buf.push('00'); buf.push(('00' + (c & 0xff).toString(16)).substr(-2)); } } this.buf += '%' + buf.join('%'); // Note the hex array is returned, not string with '%' // Might be useful if one want to loop over the data. return buf; }, // UTF-16LE utf16le : function(t) { // U+0500 => %00%05 var i, c, buf = []; for (i = 0; i < t.length; ++i) { if ((c = t.charCodeAt(i)) > 0xff) { buf.push(('00' + (c & 0xff).toString(16)).substr(-2)); buf.push(('00' + (c >> 0x08).toString(16)).substr(-2)); } else { buf.push(('00' + (c & 0xff).toString(16)).substr(-2)); buf.push('00'); } } this.buf += '%' + buf.join('%'); // Note the hex array is returned, not string with '%' // Might be useful if one want to loop over the data. return buf; }, // UTF-8 utf8 : function(t) { this.buf += encodeURIComponent(t); return this; }, // Direct copy copy : function(t) { this.buf += t; return this; } }; </code></pre> <hr> <h3>Previous answer:</h3> <hr> I do not have any setup to replicate yours, but if your case is the same as @jlarson then the resulting file should be correct. This answer became somewhat long, (fun topic you say?), but discuss various aspects around the question, what is (likely) happening, and how to actually check what is going on in various ways. <h3>TL;DR:</h3> The text is likely imported as ISO-8859-1, Windows-1252, or the like, and not as UTF-8. Force application to read file as UTF-8 by using import or other means. <hr> PS: The UniSearcher is a nice tool to have available on this journey. <h3>The long way around</h3> The "easiest" way to be 100% sure what we are looking at is to use a hex-editor on the result. Alternatively use <code>hexdump</code>, <code>xxd</code> or the like from command line to view the file. In this case the byte sequence should be that of UTF-8 as delivered from the script. As an example if we take the script of jlarson it takes the <code>data</code> Array: <pre class="prettyprint"><code>data = ['name', 'city', 'state'], ['\u0500\u05E1\u0E01\u1054', 'seattle', 'washington'] </code></pre> This one is merged into the string: <pre class="prettyprint"><code> name,city,state<newline> \u0500\u05E1\u0E01\u1054,seattle,washington<newline> </code></pre> which translates by Unicode to: <pre class="prettyprint"><code> name,city,state<newline> Ԁסกၔ,seattle,washington<newline> </code></pre> As UTF-8 uses ASCII as base (bytes with highest bit not set are the same as in ASCII) the only special sequence in the test data is "Ԁסกၔ" which in turn, is: <pre class="prettyprint"><code>Code-point Glyph UTF-8 ---------------------------- U+0500 Ԁ d4 80 U+05E1 ס d7 a1 U+0E01 ก e0 b8 81 U+1054 ၔ e1 81 94 </code></pre> Looking at the hex-dump of the downloaded file: <pre class="prettyprint"><code>0000000: 6e61 6d65 2c63 6974 792c 7374 6174 650a name,city,state. 0000010: d480 d7a1 e0b8 81e1 8194 2c73 6561 7474 ..........,seatt 0000020: 6c65 2c77 6173 6869 6e67 746f 6e0a le,washington. </code></pre> On second line we find <code>d480 d7a1 e0b8 81e1 8194</code> which match up with the above: <pre class="prettyprint"><code>0000010: d480 d7a1 e0b8 81 e1 8194 2c73 6561 7474 ..........,seatt | | | | | | | | | | | | | | +-+-+ +-+-+ +--+--+ +--+--+ | | | | | | | | | | | | | | | | Ԁ ס ก ၔ , s e a t t </code></pre> None of the other characters is mangled either. Do similar tests if you want. The result should be the similar. <hr> <h3>By sample provided <code>â€”, â€, â€&oelig;</code> </h3> We can also have a look at the sample provided in the question. It is likely to assume that the text is represented in Excel / TextEdit by code-page 1252. To quote Wikipedia on Windows-1252: <blockquote> Windows-1252 or CP-1252 is a character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows in English and some other Western languages. It is one version within the group of Windows code pages. In LaTeX packages, it is referred to as "ansinew". </blockquote> <h3>Retrieving the original bytes</h3> To translate it back into it's original form we can look at the code page layout, from which we get: <pre class="prettyprint"><code>Character: <â> <€> <”> <,> < > <â> <€> < > <,> < > <â> <€> <&oelig;> U.Hex : e2 20ac 201d 2c 20 e2 20ac 9d 2c 20 e2 20ac 153 T.Hex : e2 80 94 2c 20 e2 80 9d* 2c 20 e2 80 9c </code></pre> <ul> <li> <code>U</code> is short for Unicode </li> <li> <code>T</code> is short for Translated </li> </ul> For example: <pre class="prettyprint"><code>â => Unicode 0xe2 => CP-1252 0xe2 ” => Unicode 0x201d => CP-1252 0x94 € => Unicode 0x20ac => CP-1252 0x80 </code></pre> Special cases like <code>9d</code> does not have a corresponding code-point in CP-1252, these we simply copy directly. Note: If one look at mangled string by copying the text to a file and doing a hex-dump, save the file with for example UTF-16 encoding to get the Unicode values as represented in the table. E.g. in Vim: <pre class="prettyprint"><code>set fenc=utf-16 # Or set fenc=ucs-2 </code></pre> <h3>Bytes to UTF-8</h3> We then combine the result, the <code>T.Hex</code> line, into UTF-8. In UTF-8 sequences the bytes are represented by a leading byte telling us how many subsequent bytes make the glyph. For example if a byte has the binary value <code>110x xxxx</code> we know that this byte and the next represent one code-point. A total of two. <code>1110 xxxx</code> tells us it is three and so on. ASCII values does not have the high bit set, as such any byte matching <code>0xxx xxxx</code> is a standalone. A total of one byte. <pre class="prettyprint">0xe2 = 1110 0010bin => 3 bytes => 0xe28094 (em-dash) — 0x2c = 0010 1100bin => 1 byte => 0x2c (comma) , 0x2c = 0010 0000bin => 1 byte => 0x20 (space) 0xe2 = 1110 0010bin => 3 bytes => 0xe2809d (right-dq) ” 0x2c = 0010 1100bin => 1 byte => 0x2c (comma) , 0x2c = 0010 0000bin => 1 byte => 0x20 (space) 0xe2 = 1110 0010bin => 3 bytes => 0xe2809c (left-dq) “ </pre> Conclusion; The original UTF-8 string was: <pre class="prettyprint"><code>—, ”, “ </code></pre> <h3>Mangling it back</h3> We can also do the reverse. The original string as bytes: <pre class="prettyprint"><code>UTF-8: e2 80 94 2c 20 e2 80 9d 2c 20 e2 80 9c </code></pre> Corresponding values in cp-1252: <pre class="prettyprint"><code>e2 => â 80 => € 94 => ” 2c => , 20 => <space> ... </code></pre> and so on, result: <pre class="prettyprint"><code>â€”, â€, â€&oelig; </code></pre> <hr> <h3>Importing to MS Excel</h3> In other words: The issue at hand could be how to import UTF-8 text files into MS Excel, and some other applications. In Excel this can be done in various ways. <ul> <li>Method one:</li> </ul> Do not save the file with an extension recognized by the application, like <code>.csv</code>, or <code>.txt</code>, but omit it completely or make something up. As an example save the file as <code>"testfile"</code>, with no extension. Then in Excel open the file, confirm that we actually want to open this file, and voilà we get served with the encoding option. Select UTF-8, and file should be correctly read. <ul> <li>Method two:</li> </ul> Use import data instead of open file. Something like: <pre class="prettyprint"><code>Data -> Import External Data -> Import Data </code></pre> Select encoding and proceed. <h3>Check that Excel and selected font actually supports the glyph</h3> We can also test the font support for the Unicode characters by using the, sometimes, friendlier clipboard. For example, copy text from this page into Excel: <ul> <li>page with code points U+0E00 to U+0EFF</li> </ul> If support for the code points exist, the text should render fine. <hr> <h3>Linux</h3> On Linux, which is primarily UTF-8 in userland this should not be an issue. Using Libre Office Calc, Vim, etc. show the files correctly rendered. <hr> <h3>Why it works (or should)</h3> encodeURI from the spec states, (also read sec-15.1.3): <blockquote> The encodeURI function computes a new version of a URI in which each instance of certain characters is replaced by one, two, three, or four escape sequences representing the UTF-8 encoding of the character. </blockquote> We can simply test this in our console by, for example saying: <pre class="prettyprint"><code>>> encodeURI('Ԁסกၔ,seattle,washington') << "%D4%80%D7%A1%E0%B8%81%E1%81%94,seattle,washington" </code></pre> As we register the escape sequences are equal to the ones in the hex dump above: <pre class="prettyprint"><code>%D4%80%D7%A1%E0%B8%81%E1%81%94 (encodeURI in log) d4 80 d7 a1 e0 b8 81 e1 81 94 (hex-dump of file) </code></pre> or, testing a 4-byte code: <pre class="prettyprint"><code>>> encodeURI('󱀁') << "%F3%B1%80%81" </code></pre> <hr> <h3>If this is does not comply</h3> If nothing of this apply it could help if you added <ol> <li>Sample of expected input vs mangled output, (copy paste).</li> <li>Sample hex-dump of original data vs result file.</li> </ol>

Encoding issues for UTF8 CSV file when opening Excel and TextEdit

Tags:

javascript

csv

excel

encoding

utf-8

I recently added a CSV-download button that takes data from database (Postgres) an array from server (Ruby on Rails), and turns it into a CSV file on the client side (Javascript, HTML5). I'm currently testing the CSV file and I am coming across some encoding issues.

When I view the CSV file via 'less', the file appears fine. But when I open the file in Excel OR TextEdit, I start seeing weird characters like

â€”, â€, â€œ

appear in the text. Basically, I see the characters that are described here: http://digwp.com/2011/07/clean-up-weird-characters-in-database/

I read that this sort of issue can arise when the Database encoding setting is set to the wrong one. BUT, the database that I am using is set to use UTF8 encoding. And when I debug through the JS codes that create the CSV file, the text appear normal. (This could be a Chrome ability, and less capability)

I'm feeling frustrated because the only thing I am learning from my online search is that there could be many reasons why encoding is not working, I'm not sure which part is at fault (so excuse me as I initially tag numerous things), and nothing I tried has shed new light on my problem.

For reference, here's the JavaScript snippet that creates the CSV file!

$(document).ready(function() { var csvData = <%= raw to_csv(@view_scope, clicks_post).as_json %>; var csvContent = "data:text/csv;charset=utf-8,"; csvData.forEach(function(infoArray, index){   var dataString = infoArray.join(",");   csvContent += dataString+ "\n"; });  var encodedUri = encodeURI(csvContent); var button = $('<a>'); button.text('Download CSV'); button.addClass("button right"); button.attr('href', encodedUri); button.attr('target','_blank'); button.attr('download','<%=title%>_25_posts.csv'); $("#<%=title%>_download_action").append(button); });

748

asked Jan 24 '14 21:01

Ji Mun

1 Answers

As @jlarson updated with information that Mac was the biggest culprit we might get some further. Office for Mac has, at least 2011 and back, rather poor support for reading Unicode formats when importing files.

Support for UTF-8 seems to be close to non-existent, have read a tiny few comments about it working, whilst the majority say it does not. Unfortunately I do not have any Mac to test on. So again: The files themselves should be OK as UTF-8, but the import halts the process.

Wrote up a quick test in Javascript for exporting percent escaped UTF-16 little and big endian, with- / without BOM etc.

Code should probably be refactored but should be OK for testing. It might work better then UTF-8. Of course this also usually means bigger data transfers as any glyph is two or four bytes.

You can find a fiddle here:

Unicode export sample Fiddle

Note that it does not handle CSV in any particular way. It is mainly meant for pure conversion to data URL having UTF-8, UTF-16 big/little endian and +/- BOM. There is one option in the fiddle to replace commas with tabs, – but believe that would be rather hackish and fragile solution if it works.

Typically use like:

// Initiate encoder = new DataEnc({     mime   : 'text/csv',     charset: 'UTF-16BE',     bom    : true });  // Convert data to percent escaped text encoder.enc(data);  // Get result var result = encoder.pay();

There is two result properties of the object:

1.) encoder.lead

This is the mime-type, charset etc. for data URL. Built from options passed to initializer, or one can also say .config({ ... new conf ...}).intro() to re-build.

data:[<MIME-type>][;charset=<encoding>][;base64]

You can specify base64, but there is no base64 conversion (at least not this far).

2.) encoder.buf

This is a string with the percent escaped data.

The .pay() function simply return 1.) and 2.) as one.

Main code:

function DataEnc(a) {     this.config(a);     this.intro(); } /* * http://www.iana.org/assignments/character-sets/character-sets.xhtml * */ DataEnc._enctype = {         u8    : ['u8', 'utf8'],         // RFC-2781, Big endian should be presumed if none given         u16be : ['u16', 'u16be', 'utf16', 'utf16be', 'ucs2', 'ucs2be'],         u16le : ['u16le', 'utf16le', 'ucs2le'] }; DataEnc._BOM = {         'none'     : '',         'UTF-8'    : '%ef%bb%bf', // Discouraged         'UTF-16BE' : '%fe%ff',         'UTF-16LE' : '%ff%fe' }; DataEnc.prototype = {     // Basic setup     config : function(a) {         var opt = {             charset: 'u8',             mime   : 'text/csv',             base64 : 0,             bom    : 0         };         a = a || {};         this.charset = typeof a.charset !== 'undefined' ?                         a.charset : opt.charset;         this.base64 = typeof a.base64 !== 'undefined' ? a.base64 : opt.base64;         this.mime = typeof a.mime !== 'undefined' ? a.mime : opt.mime;         this.bom = typeof a.bom !== 'undefined' ? a.bom : opt.bom;          this.enc = this.utf8;         this.buf = '';         this.lead = '';         return this;     },     // Create lead based on config     // data:[<MIME-type>][;charset=<encoding>][;base64],<data>     intro : function() {         var             g = [],             c = this.charset || '',             b = 'none'         ;         if (this.mime && this.mime !== '')             g.push(this.mime);         if (c !== '') {             c = c.replace(/[-\s]/g, '').toLowerCase();             if (DataEnc._enctype.u8.indexOf(c) > -1) {                 c = 'UTF-8';                 if (this.bom)                     b = c;                 this.enc = this.utf8;             } else if (DataEnc._enctype.u16be.indexOf(c) > -1) {                 c = 'UTF-16BE';                 if (this.bom)                     b = c;                 this.enc = this.utf16be;             } else if (DataEnc._enctype.u16le.indexOf(c) > -1) {                 c = 'UTF-16LE';                 if (this.bom)                     b = c;                 this.enc = this.utf16le;             } else {                 if (c === 'copy')                     c = '';                 this.enc = this.copy;             }         }         if (c !== '')             g.push('charset=' + c);         if (this.base64)             g.push('base64');         this.lead = 'data:' + g.join(';') + ',' + DataEnc._BOM[b];         return this;     },     // Deliver     pay : function() {         return this.lead + this.buf;     },     // UTF-16BE     utf16be : function(t) { // U+0500 => %05%00         var i, c, buf = [];         for (i = 0; i < t.length; ++i) {             if ((c = t.charCodeAt(i)) > 0xff) {                 buf.push(('00' + (c >> 0x08).toString(16)).substr(-2));                 buf.push(('00' + (c  & 0xff).toString(16)).substr(-2));             } else {                 buf.push('00');                 buf.push(('00' + (c  & 0xff).toString(16)).substr(-2));             }         }         this.buf += '%' + buf.join('%');         // Note the hex array is returned, not string with '%'         // Might be useful if one want to loop over the data.         return buf;     },     // UTF-16LE     utf16le : function(t) { // U+0500 => %00%05         var i, c, buf = [];         for (i = 0; i < t.length; ++i) {             if ((c = t.charCodeAt(i)) > 0xff) {                 buf.push(('00' + (c  & 0xff).toString(16)).substr(-2));                 buf.push(('00' + (c >> 0x08).toString(16)).substr(-2));             } else {                 buf.push(('00' + (c  & 0xff).toString(16)).substr(-2));                 buf.push('00');             }         }         this.buf += '%' + buf.join('%');         // Note the hex array is returned, not string with '%'         // Might be useful if one want to loop over the data.         return buf;     },     // UTF-8     utf8 : function(t) {         this.buf += encodeURIComponent(t);         return this;     },     // Direct copy     copy : function(t) {         this.buf += t;         return this;     } };

Previous answer:

I do not have any setup to replicate yours, but if your case is the same as @jlarson then the resulting file should be correct.

This answer became somewhat long, (fun topic you say?), but discuss various aspects around the question, what is (likely) happening, and how to actually check what is going on in various ways.

TL;DR:

The text is likely imported as ISO-8859-1, Windows-1252, or the like, and not as UTF-8. Force application to read file as UTF-8 by using import or other means.

PS: The UniSearcher is a nice tool to have available on this journey.

The long way around

The "easiest" way to be 100% sure what we are looking at is to use a hex-editor on the result. Alternatively use hexdump, xxd or the like from command line to view the file. In this case the byte sequence should be that of UTF-8 as delivered from the script.

As an example if we take the script of jlarson it takes the data Array:

data = ['name', 'city', 'state'],        ['\u0500\u05E1\u0E01\u1054', 'seattle', 'washington']

This one is merged into the string:

 name,city,state<newline>  \u0500\u05E1\u0E01\u1054,seattle,washington<newline>

which translates by Unicode to:

 name,city,state<newline>  Ԁסกၔ,seattle,washington<newline>

As UTF-8 uses ASCII as base (bytes with highest bit not set are the same as in ASCII) the only special sequence in the test data is "Ԁסกၔ" which in turn, is:

Code-point  Glyph      UTF-8 ----------------------------     U+0500    Ԁ        d4 80     U+05E1    ס        d7 a1     U+0E01    ก     e0 b8 81     U+1054    ၔ     e1 81 94

Looking at the hex-dump of the downloaded file:

0000000: 6e61 6d65 2c63 6974 792c 7374 6174 650a  name,city,state. 0000010: d480 d7a1 e0b8 81e1 8194 2c73 6561 7474  ..........,seatt 0000020: 6c65 2c77 6173 6869 6e67 746f 6e0a       le,washington.

On second line we find d480 d7a1 e0b8 81e1 8194 which match up with the above:

0000010: d480  d7a1  e0b8 81  e1 8194 2c73 6561 7474  ..........,seatt          |   | |   | |     |  |     |  | |  | |  | |          +-+-+ +-+-+ +--+--+  +--+--+  | |  | |  | |            |     |      |        |     | |  | |  | |            Ԁ     ס      ก        ၔ     , s  e a  t t

None of the other characters is mangled either.

Do similar tests if you want. The result should be the similar.

By sample provided `â€”, â€, â€œ`

We can also have a look at the sample provided in the question. It is likely to assume that the text is represented in Excel / TextEdit by code-page 1252.

To quote Wikipedia on Windows-1252:

Windows-1252 or CP-1252 is a character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows in English and some other Western languages. It is one version within the group of Windows code pages. In LaTeX packages, it is referred to as "ansinew".

Retrieving the original bytes

To translate it back into it's original form we can look at the code page layout, from which we get:

Character:   <â>  <€>  <”>  <,>  < >  <â>  <€>  < >  <,>  < >  <â>  <€>  <œ> U.Hex    :    e2 20ac 201d   2c   20   e2 20ac   9d   2c   20   e2 20ac  153 T.Hex    :    e2   80   94   2c   20   e2   80   9d*  2c   20   e2   80   9c

U is short for Unicode
T is short for Translated

For example:

â => Unicode 0xe2   => CP-1252 0xe2 ” => Unicode 0x201d => CP-1252 0x94 € => Unicode 0x20ac => CP-1252 0x80

Special cases like 9d does not have a corresponding code-point in CP-1252, these we simply copy directly.

Note: If one look at mangled string by copying the text to a file and doing a hex-dump, save the file with for example UTF-16 encoding to get the Unicode values as represented in the table. E.g. in Vim:

set fenc=utf-16 # Or set fenc=ucs-2

Bytes to UTF-8

We then combine the result, the T.Hex line, into UTF-8. In UTF-8 sequences the bytes are represented by a leading byte telling us how many subsequent bytes make the glyph. For example if a byte has the binary value 110x xxxx we know that this byte and the next represent one code-point. A total of two. 1110 xxxx tells us it is three and so on. ASCII values does not have the high bit set, as such any byte matching 0xxx xxxx is a standalone. A total of one byte.

0xe2 = 1110 0010_bin => 3 bytes => 0xe28094 (em-dash)  — 0x2c = 0010 1100_bin => 1 byte  => 0x2c     (comma)    , 0x2c = 0010 0000_bin => 1 byte  => 0x20     (space)    0xe2 = 1110 0010_bin => 3 bytes => 0xe2809d (right-dq) ” 0x2c = 0010 1100_bin => 1 byte  => 0x2c     (comma)    , 0x2c = 0010 0000_bin => 1 byte  => 0x20     (space)    0xe2 = 1110 0010_bin => 3 bytes => 0xe2809c (left-dq)  “

Conclusion; The original UTF-8 string was:

—, ”, “

Mangling it back

We can also do the reverse. The original string as bytes:

UTF-8: e2 80 94 2c 20 e2 80 9d 2c 20 e2 80 9c

Corresponding values in cp-1252:

e2 => â 80 => € 94 => ” 2c => , 20 => <space> ...

and so on, result:

â€”, â€, â€œ

Importing to MS Excel

In other words: The issue at hand could be how to import UTF-8 text files into MS Excel, and some other applications. In Excel this can be done in various ways.

Method one:

Do not save the file with an extension recognized by the application, like .csv, or .txt, but omit it completely or make something up.

As an example save the file as "testfile", with no extension. Then in Excel open the file, confirm that we actually want to open this file, and voilà we get served with the encoding option. Select UTF-8, and file should be correctly read.

Method two:

Use import data instead of open file. Something like:

Data -> Import External Data -> Import Data

Select encoding and proceed.

Check that Excel and selected font actually supports the glyph

We can also test the font support for the Unicode characters by using the, sometimes, friendlier clipboard. For example, copy text from this page into Excel:

page with code points U+0E00 to U+0EFF

If support for the code points exist, the text should render fine.

Linux

On Linux, which is primarily UTF-8 in userland this should not be an issue. Using Libre Office Calc, Vim, etc. show the files correctly rendered.

Why it works (or should)

encodeURI from the spec states, (also read sec-15.1.3):

The encodeURI function computes a new version of a URI in which each instance of certain characters is replaced by one, two, three, or four escape sequences representing the UTF-8 encoding of the character.

We can simply test this in our console by, for example saying:

>> encodeURI('Ԁסกၔ,seattle,washington') << "%D4%80%D7%A1%E0%B8%81%E1%81%94,seattle,washington"

As we register the escape sequences are equal to the ones in the hex dump above:

%D4%80%D7%A1%E0%B8%81%E1%81%94 (encodeURI in log)  d4 80 d7 a1 e0 b8 81 e1 81 94 (hex-dump of file)

or, testing a 4-byte code:

>> encodeURI('󱀁') << "%F3%B1%80%81"

If this is does not comply

If nothing of this apply it could help if you added

Sample of expected input vs mangled output, (copy paste).
Sample hex-dump of original data vs result file.

112

answered Oct 14 '22 13:10

user13500

Related questions
                            
                                What is the difference between != and !== operators in JavaScript?
                            
                                Binding to the scroll wheel when over a div
                            
                                Stop execution of Javascript function (client side) or tweak it
                            
                                What are pros and cons of using extjs? [closed]
                            
                                HTML make text clickable without making it a hyperlink
                            
                                Specify scope for eval() in JavaScript?
                            
                                Confused on how a JSONP request works
                            
                                Node.js browserify slow: isn't there a way to cache big libraries?
                            
                                Failed to instantiate module error in Angular js
                            
                                Understanding execute async script in Selenium
                            
                                v-for without using html element in vue.js
                            
                                Remove key press delay in Javascript
                            
                                jQuery equivalent of body onLoad
                            
                                How does one disable Caching in jQuery Mobile UI
                            
                                How do I add a .click() event to an image?
                            
                                Combining two arrays to form a javascript object
                            
                                JavaScript: get custom button's text value
                            
                                How to control Sass Variable with javascript
                            
                                Select All the objects on canvas using Fabric.js
                            
                                Make cell readonly in Kendo Grid if condition is met

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Encoding issues for UTF8 CSV file when opening Excel and TextEdit

Tags:

javascript

csv

excel

encoding

utf-8

Ji Mun

People also ask

1 Answers

Main code:

Previous answer:

TL;DR:

The long way around

By sample provided `â€”, â€, â€œ`

Retrieving the original bytes

Bytes to UTF-8

Mangling it back

Importing to MS Excel

Check that Excel and selected font actually supports the glyph

Linux

Why it works (or should)

If this is does not comply

user13500

Recent Activity

Donate For Us

Encoding issues for UTF8 CSV file when opening Excel and TextEdit

Tags:

javascript

csv

excel

encoding

utf-8

Ji Mun

People also ask

1 Answers

Main code:

Previous answer:

TL;DR:

The long way around

By sample provided â€”, â€, â€œ

Retrieving the original bytes

Bytes to UTF-8

Mangling it back

Importing to MS Excel

Check that Excel and selected font actually supports the glyph

Linux

Why it works (or should)

If this is does not comply

user13500

Related questions

Recent Activity

Donate For Us

By sample provided `â€”, â€, â€œ`