Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

node.js readfile error with utf8 encoded file on windows

I'm trying to load a UTF8 json file from disk using node.js (0.10.29) on Windows 8.1. The following is the code that runs:

var http = require('http');
var utils = require('util');
var path = require('path');
var fs = require('fs');

var myconfig;
fs.readFile('./myconfig.json', 'utf8', function (err, data) {
    if (err) {
        console.log("ERROR: Configuration load - " + err);
        throw err;
    } else {
        try {
            myconfig = JSON.parse(data);
            console.log("Configuration loaded successfully");
        }
        catch (ex) {
            console.log("ERROR: Configuration parse - " + err);
        }


    }
});

I get the following error when I run this:

SyntaxError: Unexpected token ´╗┐
    at Object.parse (native)
    ...

Now, when I change the file encoding (using Notepad++) to ANSI, it works without a problem.

Any ideas why this is the case? Whilst development is being done on Windows the final solution will be deployed to a variety of non-Windows servers, I'm worried that I'll run into issues on the server end if I deploy an ANSI file to Linux, for example.

According to my searches here and via Google the code should work on Windows as I am specifically telling it to expect a UTF-8 file.

Sample config I am reading:

{
    "ListenIP4": "10.10.1.1",
    "ListenPort": 8080
}
like image 333
Dominik Avatar asked Jun 22 '14 23:06

Dominik


3 Answers

Per "fs.readFileSync(filename, 'utf8') doesn't strip BOM markers #1918", fs.readFile is working as designed: BOM is not stripped from the header of the UTF-8 file, if it exists. It at the discretion of the developer to handle this.

Possible workarounds:

  • data = data.replace(/^\uFEFF/, ''); per https://github.com/joyent/node/issues/1918#issuecomment-2480359
  • Transform the incoming stream to remove the BOM header with the NPM module bomstrip per https://github.com/joyent/node/issues/1918#issuecomment-38491548

What you are getting is the byte order mark header (BOM) of the UTF-8 file. When JSON.parse sees this, it gives an syntax error (read: "unexpected character" error). You must strip the byte order mark from the file before passing it to JSON.parse:

fs.readFile('./myconfig.json', 'utf8', function (err, data) {
    myconfig = JSON.parse(data.toString('utf8').replace(/^\uFEFF/, ''));
});
// note: data is an instance of Buffer
like image 145
zamnuts Avatar answered Oct 21 '22 10:10

zamnuts


To get this to work without I had to change the encoding from "UTF-8" to "UTF-8 without BOM" using Notepad++ (I assume any decent text editor - not Notepad - has the ability to choose this encoding type).

This solution meant that the deployment guys could deploy to Unix without a hassle, and I could develop without errors during the reading of the file.

In terms of reading the file, the other response I sometimes got in my travels was a question mark appended before the start of the file contents, when trying various encoding options. Naturally with a question mark or ANSI characters appended the JSON.parse fails.

Hope this helps someone!

like image 27
Dominik Avatar answered Oct 21 '22 11:10

Dominik


New answer
As i had the same problem with several different formats I went ahead and made a npm that try to read textfiles and parse it as text, no matter the original format. (as original question was to read a .json it would fit perfect). (files without BOM and unknown BOM is handled as ASCII/latin1)

https://www.npmjs.com/package/textfilereader

So change the code to

var http = require('http');
var utils = require('util');
var path = require('path');
var fs = require('textfilereader');

var myconfig;
fs.readFile('./myconfig.json', 'utf8', function (err, data) {
    if (err) {
        console.log("ERROR: Configuration load - " + err);
        throw err;
    } else {
        try {
            myconfig = JSON.parse(data);
            console.log("Configuration loaded successfully");
        }
        catch (ex) {
            console.log("ERROR: Configuration parse - " + err);
        }
    }
});

Old answer

Run into this problem today and created function to take care of it. Should have a very small footprint, assume it's better than the accepted replace solution.

function removeBom(input) {
  // All alternatives found on https://en.wikipedia.org/wiki/Byte_order_mark
  const fc = input[0].charCodeAt(0).toString(16);
  switch (fc) {
    case 'efbbbf': // UTF-8
    case 'feff': // UTF-16 (BE) + UTF-32 (BE)
    case 'fffe': // UTF-16 (LE)
    case 'fffe0000': // UTF-32 (LE)
    case '2B2F76': // UTF-7
    case 'f7644c': // UTF-1
    case 'dd736673': // UTF-EBCDIC
    case 'efeff': // SCSU
    case 'fbee28': // BOCU-1
    case '84319533': // GB-18030
      return input.slice(1);
      break;
    default: 
      return input;
  }
}

const fileBuffer = removeBom(fs.readFileSync(filePath, "utf8"));
like image 39
Griffin Avatar answered Oct 21 '22 11:10

Griffin