Problems with parsing UTF8 characters in request body?

Question

When implementing HTTP services in node.js, there is a lot of sample code like below used to get the whole request entity (data uploaded by the client, for example a POST with JSON data) :

var http = require('http');

var server = http.createServer(function(req, res) {
    var data = '';
    req.setEncoding('utf8');

    req.on('data', function(chunk) {
        data += chunk;
    });

    req.on('end', function() {
        // parse data
    });
});

Using req.setEncoding('utf8') automatically decodes input bytes into string, assuming the input is UTF8-encoded. But I get the feeling that it can break. What if we receive a chunk of data that ends in the middle of a multi-byte UTF8 character ? We can simulate this :

> new Buffer("café")
<Buffer 63 61 66 c3 a9>
> new Buffer("café").slice(0,4)
<Buffer 63 61 66 c3>
> new Buffer("café").slice(0,4).toString('utf8')
'caf?'

So we get an erroneous character instead of waiting for the next bytes to properly decode the last character.

Therefore, unless the request object take cares of this, making sure that only completely decoded characters are pushed into chunks, this ubiquitous code sample is broken.

The alternative would be to use buffers, handling the problem of buffer size limits :

var http = require('http');
var MAX_REQUEST_BODY_SIZE = 16 * 1024 * 1024;

var server = http.createServer(function(req, res) {
    // A better way to do this could be to start with a small buffer
    // and grow it geometrically until the limit is reached.
    var requestBody = new Buffer(MAX_REQUEST_BODY_SIZE); 
    var requestBodyLength = 0;

    req.on('data', function(chunk) {
        if(requestBodyLength + chunk.length >= MAX_REQUEST_BODY_SIZE) {
           res.statusCode = 413; // Request Entity Too Large
           return;
        }
        chunk.copy(requestBody, requestBodyLength, 0, chunk.length);
        requestBodyLength += chunk.length;
    });

    req.on('end', function() {
        if(res.statusCode == 413) {
            // handle 413 error
            return;
        }

        requestBody = requestBody.toString('utf8', 0, requestBodyLength);
        // process requestBody as string
    });
});

Am I right, or is this already taken care by the http request class ?

loganfsmyth · Accepted Answer

This is taken care of automatically. There is a string_decoder module in node which is loaded when you call setEncoding. The decoder will check the last few bytes received and store them between emits of 'data' if they are not full characters, so data will always get a correct string. If you do not do setEncoding, and don't use string_decoder yourself, then the buffer emitted can have the issue you mentioned, though.

The docs aren't much help though, http://nodejs.org/docs/latest/api/string_decoder.html, but you can see the module here, https://github.com/joyent/node/blob/master/lib/string_decoder.js

The implementation of 'setEncoding' and logic for emitting also makes it clearer.

setEncoding: https://github.com/joyent/node/blob/master/lib/http.js#L270
_emitData https://github.com/joyent/node/blob/master/lib/http.js#L306

seukim · Answer

Just add response.setEncoding('utf8'); to request.on('response') callback function. In my case that was sufficient.

Problems with parsing UTF8 characters in request body?

Tags:

node.js

Nicolas Lehuen

2 Answers

loganfsmyth

seukim

Recent Activity

Donate For Us

Problems with parsing UTF8 characters in request body?

Tags:

node.js

Nicolas Lehuen

2 Answers

loganfsmyth

seukim

Related questions

Recent Activity

Donate For Us