Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Problems with parsing UTF8 characters in request body?

Tags:

node.js

When implementing HTTP services in node.js, there is a lot of sample code like below used to get the whole request entity (data uploaded by the client, for example a POST with JSON data) :

var http = require('http');

var server = http.createServer(function(req, res) {
    var data = '';
    req.setEncoding('utf8');

    req.on('data', function(chunk) {
        data += chunk;
    });

    req.on('end', function() {
        // parse data
    });
});

Using req.setEncoding('utf8') automatically decodes input bytes into string, assuming the input is UTF8-encoded. But I get the feeling that it can break. What if we receive a chunk of data that ends in the middle of a multi-byte UTF8 character ? We can simulate this :

> new Buffer("café")
<Buffer 63 61 66 c3 a9>
> new Buffer("café").slice(0,4)
<Buffer 63 61 66 c3>
> new Buffer("café").slice(0,4).toString('utf8')
'caf?'

So we get an erroneous character instead of waiting for the next bytes to properly decode the last character.

Therefore, unless the request object take cares of this, making sure that only completely decoded characters are pushed into chunks, this ubiquitous code sample is broken.

The alternative would be to use buffers, handling the problem of buffer size limits :

var http = require('http');
var MAX_REQUEST_BODY_SIZE = 16 * 1024 * 1024;

var server = http.createServer(function(req, res) {
    // A better way to do this could be to start with a small buffer
    // and grow it geometrically until the limit is reached.
    var requestBody = new Buffer(MAX_REQUEST_BODY_SIZE); 
    var requestBodyLength = 0;

    req.on('data', function(chunk) {
        if(requestBodyLength + chunk.length >= MAX_REQUEST_BODY_SIZE) {
           res.statusCode = 413; // Request Entity Too Large
           return;
        }
        chunk.copy(requestBody, requestBodyLength, 0, chunk.length);
        requestBodyLength += chunk.length;
    });

    req.on('end', function() {
        if(res.statusCode == 413) {
            // handle 413 error
            return;
        }

        requestBody = requestBody.toString('utf8', 0, requestBodyLength);
        // process requestBody as string
    });
});

Am I right, or is this already taken care by the http request class ?

like image 681
Nicolas Lehuen Avatar asked Jan 28 '12 14:01

Nicolas Lehuen


2 Answers

This is taken care of automatically. There is a string_decoder module in node which is loaded when you call setEncoding. The decoder will check the last few bytes received and store them between emits of 'data' if they are not full characters, so data will always get a correct string. If you do not do setEncoding, and don't use string_decoder yourself, then the buffer emitted can have the issue you mentioned, though.

The docs aren't much help though, http://nodejs.org/docs/latest/api/string_decoder.html, but you can see the module here, https://github.com/joyent/node/blob/master/lib/string_decoder.js

The implementation of 'setEncoding' and logic for emitting also makes it clearer.

  • setEncoding: https://github.com/joyent/node/blob/master/lib/http.js#L270
  • _emitData https://github.com/joyent/node/blob/master/lib/http.js#L306
like image 73
loganfsmyth Avatar answered Oct 22 '22 20:10

loganfsmyth


Just add response.setEncoding('utf8'); to request.on('response') callback function. In my case that was sufficient.

like image 1
seukim Avatar answered Oct 22 '22 19:10

seukim