Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can MongoDB store and manipulate strings of UTF-8 with code points outside the basic multilingual plane?

In MongoDB 2.0.6, when attempting to store documents or query documents that contain string fields, where the value of a string include characters outside the BMP, I get a raft of errors like: "Not proper UTF-16: 55357", or "buffer too small"

What settings, changes, or recommendations are there to permit storage and query of multi-lingual strings in Mongo, particularly ones that include these characters above 0xFFFF?

Thanks.

like image 998
Eli Avatar asked Jul 31 '12 19:07

Eli


People also ask

Does MongoDB support UTF-8?

MongoDB uses UTF-8 character encoding which is part of the Unicode Standard.

In which format data is stored in MongoDB?

MongoDB stores data in BSON format both internally, and over the network, but that doesn't mean you can't think of MongoDB as a JSON database. Anything you can represent in JSON can be natively stored in MongoDB, and retrieved just as easily in JSON.

What is UTF-8 in node JS?

Overview. In this guide, you can learn how to enable or disable the Node. js driver's UTF-8 validation feature. UTF-8 is a character encoding specification that ensures compatibility and consistent presentation across most operating systems, applications, and language character sets.

What UTF-8 means?

UTF-8 (UCS Transformation Format 8) is the World Wide Web's most common character encoding. Each character is represented by one to four bytes. UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character.


1 Answers

There are several issues here:

1) Please be aware that MongoDB stores all documents using the BSON format. Also note that the BSON spec referes to a UTF-8 string encoding, not a UTF-16 encoding.

Ref: http://bsonspec.org/#/specification

2) All of the drivers, including the JavaScript driver in the mongo shell, should properly handle strings that are encoded as UTF-8. (If they don't then it's a bug!) Many of the drivers happen to handle UTF-16 properly, as well, although as far as I know, UTF-16 isn't officially supported.

3) When I tested this with the Python driver, MongoDB could successfully load and return a string value that contained a broken UTF-16 code pair. However, I couldn't load a broken code pair using the mongo shell, nor could I store a string containing a broken code pair into a JavaScript variable in the shell.

4) mapReduce() runs correctly on string data using a correct UTF-16 code pair, but it will generate an error when trying to run mapReduce() on string data containing a broken code pair.

It appears that the mapReduce() is failing when MongoDB is trying to convert the BSON to a JavaScript variable for use by the JavaScript engine.

5) I've filed Jira issue SERVER-6747 for this issue. Feel free to follow it and vote it up.

like image 184
William Z Avatar answered Oct 14 '22 06:10

William Z